To appear as a part of an upcoming textbook on and manifold learning.

Restricted and Deep Belief Network: Tutorial and Survey

Benyamin Ghojogh [email protected] Department of Electrical and Computer Engineering, Laboratory, University of Waterloo, Waterloo, ON, Canada

Ali Ghodsi [email protected] Department of Statistics and Actuarial Science & David R. Cheriton School of Computer Science, Data Analytics Laboratory, University of Waterloo, Waterloo, ON, Canada

Fakhri Karray [email protected] Department of Electrical and Computer Engineering, Centre for Pattern Analysis and Machine Intelligence, University of Waterloo, Waterloo, ON, Canada

Mark Crowley [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada

Abstract to be useful for modeling the physical systems statisti- This is a tutorial and survey paper on Boltz- cally (Huang, 1987). One of these systems was the Ising mann Machine (BM), Restricted Boltzmann Ma- model which modeled interacting particles with binary chine (RBM), and Deep Belief Network (DBN). spins (Lenz, 1920; Ising, 1925). Later, the Ising model We start with the required background on prob- was found to be able to be a neural network (Little, 1974). abilistic graphical models, Markov random field, Hence, Hopfield network was proposed which modeled an , statistical physics, Ising model, Ising model in a network for modeling memory (Hopfield, and the Hopfield network. Then, we intro- 1982). Inspired by the Hopfield network (Little, 1974; duce the structures of BM and RBM. The con- Hopfield, 1982), which was itself inspired by the physi- ditional distributions of visible and hidden vari- cal Ising model (Lenz, 1920; Ising, 1925), Hinton et. al. ables, Gibbs sampling in RBM for generating proposed Boltzmann Machine (BM) and Restricted Boltz- variables, training BM and RBM by maximum mann Machine (RBM) (Hinton & Sejnowski, 1983; Ack- likelihood estimation, and contrastive divergence ley et al., 1985). These models are energy-based models are explained. Then, we discuss different possi- (LeCun et al., 2006) and the names come from the Boltz- ble discrete and continuous distributions for the mann distribution (Boltzmann, 1868; Gibbs, 1902) used in variables. We introduce conditional RBM and these models. A BM has weighted links between two lay- how it is trained. Finally, we explain deep be- ers of neurons as well as links between neurons of every layer. RBM restricts these links to not have links between

arXiv:2107.12521v1 [cs.LG] 26 Jul 2021 lief network as a stack of RBM models. This paper on Boltzmann machines can be useful in neurons of a layer. BM and RBM take one of the layers various fields including data science, statistics, as the layer of data and the other layer as a representation neural computation, and statistical physics. or embedding of data. BM and RBM are special cases of the Ising model whose weights (coupling parameters) are learned. BM and RBM are also special cases of the Hop- 1. Introduction field network whose weights are learned by maximum like- lihood estimation rather than the Hebbian learning method Centuries ago, the Boltzmann distribution (Boltzmann, (Hebb, 1949) which is used in the Hopfield network. 1868), also called the Gibbs distribution (Gibbs, 1902), was proposed. This energy-based distribution was found The Hebbian learning method, used in the Hopfield net- work, was very weak and could not generalize well to un- seen data. Therefore, (Rumelhart et al., 1986) was proposed for training neural networks. Back- propagation was plus the chain rule tech- Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey 2 nique. However, researchers found out soon that neural The remainder of this paper is as follows. We briefly review networks cannot get deep in their number of layers. This is the required background on probabilistic graphical models, because, in deep networks, gradients become very small in Markov random field, Gibbs sampling, statistical physics, the initial layers after many chain rules from the last layers Ising model, and the Hopfield network in Section2. The of network. This problem was called vanishing gradients. structure of BM and RBM, Gibbs sampling in RBM, train- This problem of networks plus the glory of theory in ker- ing RBM by maximum likelihood estimation, contrastive nel support vector machines (Boser et al., 1992) resulted in divergence, and training BM are explained in Section3. the winter of neural networks in the last years of previous Then, we introduce different cases of states for units in century until around 2006. RBM in Section4. Conditional RBM and DBN are ex- During the winter of neural networks, Hinton tried to save plained in Sections5 and6, respectively. Finally, Section7 neural networks from being forgotten in the history of ma- concludes the paper. chine learning. So, he returned to his previously pro- posed RBM and proposed a learning method for RBM with Required Background for the Reader the help of some other researchers including Max Welling This paper assumes that the reader has general knowledge (Hinton, 2002; Welling et al., 2004). They proposed train- of calculus, probability, linear algebra, and basics of opti- ing the weights of BM and RBM using maximum likeli- mization. The required background on statistical physics is hood estimation. BM and RBM can be seen as genera- explained in the paper. tive models where new values for neurons can be gener- ated using Gibbs sampling (Geman & Geman, 1984). Hin- 2. Background ton noticed RBM because he knew that the set of weights 2.1. Probabilistic and Markov between every two layers of a neural network is an RBM. Random Field It was in the year 2006 (Hinton & Salakhutdinov, 2006; Hinton et al., 2006) that he thought it is possible to train a A Probabilistic Graphical Model (PGM) is a grpah-based network in a greedy way (Bengio et al., 2007) where the representation of a complex distribution in the possibly weights of every layer of network is trained using RBM high dimensional space (Koller & Friedman, 2009). In training. This stack of RBM models with a greedy algo- other words, PGM is a combination of graph theory and rithm for training was named Deep Belief Network (DBN) probability theory. In a PGM, the random variables are rep- (Hinton et al., 2006; Hinton, 2009). DBN allowed the net- resented by nodes or vertices. There exist edges between works to become deep by preparing a good initialization of two variables which have interaction with one another in weights (using RBM training) for backpropagation. This terms of probability. Different conditional probabilities can good starting point for backpropagation optimization did be represented by a PGM. There exist two types of PGM not face the problem of vanishing gradients anymore. Since which are Markov network (also called Markov random the breakthrough in 2006 (Hinton & Salakhutdinov, 2006), field) and (Koller & Friedman, 2009). the winter of neural networks started to end gradually be- In the Markov network and Bayesian network, the edges cause the networks could get deep to become more nonlin- of graph are undirected and directed, respectively. BM and ear and handle more nonlinear data. RBM are Markov networks (Markov random field) because their links are undirected (Hinton, 2007). DBN was used in different applications including speech recognition (Mohamed et al., 2009; Mohamed & Hinton, 2.2. Gibbs Sampling 2010; Mohamed et al., 2011) and action recognition (Tay- Gibbs sampling, firstly proposed by (Geman & Geman, lor et al., 2007). Hinton was very excited about the suc- 1984), draws samples from a d-dimensional multivari- cess of RBM and was thinking that the future of neural ate distribution (X) using d conditional distributions networks belongs to DBN. However, his research group P (Bishop, 2006). This sampling algorithm assumes that the proposed two important techniques which were the ReLu conditional distributions of every dimension of data condi- activation function (Glorot et al., 2011) and the dropout tioned on the rest of coordinates are simple to draw samples technique (Srivastava et al., 2014). These two regulariza- from. tion methods prevented overfitting (Ghojogh & Crowley, 2019) and resolved vanishing gradients even without RBM In Gibbs sampling, we desire to sample from a multivariate d pre-training. Hence, backpropagation could be used alone distribution P(X) where X ∈ R . Consider the notation d > if the new regularization methods were utilized. The suc- R 3 x := [x1, x2, . . . , xd] . We start from a random d- cess of neural networks was found out more (LeCun et al., dimensional vector in the range of data. Then, we sample 2015) by its various applications, for example in in image the first dimension of the first sample from the distribution recognition (Karpathy & Fei-Fei, 2015). of the first dimension conditioned on the other dimensions. We do it for all dimensions, where the j-th dimension is This is a tutorial and survey paper on BM, RBM, and DBN. Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey 3 sampled as (Ghojogh et al., 2020): where ln(.) is the natural logarithm. The internal energy is defined as: x ∼ (x | x , . . . , x , x , . . . , x ). (1) j P j 1 j−1 j+1 d ∂ U(β) := β F (β). (6) We do this for all dimensions until all dimensions of the ∂β first sample are drawn. Then, starting from the first sam- Therefore, we have: ple, we repeat this procedure for the dimensions of the sec- ∂ −1 ∂Z ond sample. We iteratively perform this for all samples; U(β) = (− ln(Z)) = however, some initial samples are not yet valid because the ∂β Z ∂β −βE(x) algorithm has started from a not-necessarily valid vector. (3) X e (2) X = E(x) = (x)E(x). (7) We accept all samples after some burn-in iterations. Gibbs Z P x∈ d x∈ d sampling can be seen as a special case of the Metropolis- R R Hastings algorithm which accepts the proposed samples The entropy is defined as: with probability one (see (Ghojogh et al., 2020) for proof). X  H(β) := − P(x) ln P(x) Gibbs sampling is used in BM and RBM for generating x∈ d visible and hidden samples. R (2) X  = − P(x) − βE(x) − ln(Z) 2.3. Statistical Physics and Ising Model d x∈R 2.3.1. BOLTZMANN (GIBBS)DISTRIBUTION X X = β P(x)E(x) + ln(Z) P(x) d {x } d d Assume we have several particles i i=1 in statistical x∈R x∈R | {z } physics. These particles can be seen as randome variables =1 which can randomly have a state. For example, if the par- (a) ticles are electrons, they can have states +1 and −1 for = −β F (β) + β U(β), (8) counterclockwise and clockwise spins, respectively. The where (a) is because of Eqs. (7) and (3). Boltzmann distribution (Boltzmann, 1868), also called the Lemma 1. A physical system prefers to be in low energy; Gibbs distribution (Gibbs, 1902), can show the probability hence, the system always loses energy to have less energy. that a physical system can have a specific state. i.e., every of the particles has a specific state. The probability mass Proof. On one hand, according to the second law of ther- function of this distribution is (Huang, 1987): modynamics, entropy of a physical system always in- creases by passing time (Carroll, 2010). Entropy is a mea- e−βE(x) P(x) = , (2) sure of randomness and disorder in system. On the other Z hand, when a system loses energy to its surrounding, it be- where E(x) is the energy of variable x and Z is the nor- comes less ordered. Hence, by passing time, the energy of malization constant so that the probabilities sum to one. system decreases to have more entropy. Q.E.D. This normalization constant is called the partition function Corollary 1. According to Eq. (2) and Lemma1, the prob- which is hard to compute as it sums over all possible con- ability (x) of states in a system tend to increase by passing figurations of states (values) that the particles can have. If P d > time. we define R 3 x := [x1, . . . , xd] , we have: This corollary makes sense because systems tend to be- X Z := e−βE(x). (3) come more probable. This idea is also used in simulated an- d x∈R nealing (Kirkpatrick et al., 1983) where the temperature of system is cooled down gradually. Simulated annealing and The coefficient β ≥ 0 is defined as: temperature-based learning have been used in BM models 1 1 (Passos & Papa, 2018; Alberici et al., 2020; 2021). β := ∝ , (4) kβT T 2.3.2. ISING MODEL where kβ is the Boltzmann constant and T ≥ 0 is the ab- The Lenz-Ising model (Lenz, 1920; Ising, 1925), also solute thermodynamic temperature in Kelvins. If the tem- known as Ising model, is a model in which the parti- perature tends to absolute zero, T → 0, we have β → ∞ cles can have −1 or +1 spins (Brush, 1967). Therefore, and P(x) → 0, meaning that the absolute zero temperature xi ∈ {−1, +1}, ∀i ∈ {1, . . . , d}. It uses the Boltzmann occurs extremely rarely in the universe. distribution, Eq. (2), where the energy function is defined The free energy is defined as: as: X −1 E(x) := H(x) = − J x x , (9) F (β) := ln(Z), (5) ij i j β (i,j) Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey 4 where H(x) is called the Hamiltonian, Jij ∈ R is the cou- a threshold θ: pling parameter, and the summation is over particles which  Pd interact with each other. Note that as energy is propor- +1 if j=1 wijxj ≥ θ, xi := (11) tional to the reciprocal of squared distance, nearby parti- −1 otherwise. cles are only assumed to be interacting. Therefore, usu- ally the interaction graph of particles is a chain (one di- In the original paper of Hopfield network (Hopfield, 1982), mensional grid), mesh grid (lattice), closed chain (loop), or the binary states are xi ∈ {0, 1}, ∀i so the Hebbian learning torus (multi-dimensional loop). is wij := (2xi − 1) × (2xj − 1), ∀i 6= j. Hopfield network is an Ising model so it uses Eq. (9) as its energy. This Based on the characteristic of model, the coupling parame- energy is also used in the Boltzmann distribution which is ter has different values. If for all interacting i and j, we Eq. (2). Note that recently, Hopfield network with con- have J ≥ 0 or J < 0, the models is named ferro- ij ij tinuous states has been proposed (Ramsauer et al., 2020). magnetic and anti-ferromagnetic, respectively. If J can ij The BM and RBM models are Hopfield networks whose be both positive and negative, the model is called a spin weights are learned using maximum likelihood estimation glass. If the coupling parameters are all constant, the model and not Hebbian learning. is homogeneous. According to Lemma1, the energy de- creases over time. According to Eq. (9), in ferromagnetic 3. Restricted Boltzmann Machine model (Jij ≥ 0), the energy of Ising model decreases if interacting xi and xj have the same state (spin) because 3.1. Structure of Restricted Boltzmann Machine of the negative sign behind summation. Likewise, in anti- Boltzmann Machine (BM) is a and ferromagnetic models, the nearby particles tend to have dif- a Probabilistic Graphical Model (PGM) (Bishop, 2006) ferent spins over time. which is a building block of many probabilistic models. According to Eq. (9), in ferromagnetic models, the energy Its name is because of the Boltzmann distribution (Boltz- is zero if Jij = 0. This results in P(x) = 1 according to mann, 1868; Gibbs, 1902) used in this model. It was first Eq. (2). With similar analysis and according to the previ- introduced to be used in machine learning in (Hinton & Se- ous discussion, in ferromagnetic models, Jij → ∞ yields jnowski, 1983; Ackley et al., 1985) and then in (Hinton, to having the same spins for all particles. This means that 2002; Welling et al., 2004). A BM consists of an visible d we finally have all +1 spins with probability of half or all (or observation) layer v = [v1, . . . , vd] ∈ R and a hid- p −1 spins with probability of half. Ising models can be mod- den layer h = [h1, . . . , hp] ∈ R . The visible layer is eled as normal factor graphs. For more information on this, the layer that we can see; for example, it can be the layer refer to (Molkaraie, 2017; 2020). The BM and RBM are of data. The hidden layer is the layer of latent variables Ising models whose coupling parameters are considered as which represent meaningful features or embeddings for the weights and these weights are learned using maximum like- visible data. In other words, there is a meaningful connec- lihood estimation (Hinton, 2007). Hence, we can say that tion between the hidden and visible layers although their BM and RBM are energy-based learning methods (LeCun dimensionality might differ, i.e., d 6= p. In the PGM of et al., 2006). BM, there is connection links between elements of v and elements of h. Each of the elements of v and h also have 2.4. Hopfield Network a bias. There are also links between the elements of v as It was proposed in (Little, 1974) to use the Ising model well as between the elements of h (Salakhutdinov & Hin- in a neural network structure. Hopfield extended this idea ton, 2009a). Let wij denote the link between vi and hj, to model memory by a neural network. The resulting net- and lij be the link between vi and vj, and jij be the link work was the Hopfield network (Hopfield, 1982). This net- between hi and hj, and bi be the bias link for vi, and ci be d h work has some units or neurons denoted by {xi}i=1. The the bias link for i. The dimensionality of these links are d×p d×d p×p states or outputs of units are all binary xi ∈ {−1, +1}, ∀i. W = [wij] ∈ R , L = [lij] ∈ R , J = [jij] ∈ R , d p Let wij denote the weight of link connecting unit i to unit b = [b1, . . . , bd] ∈ R , and c = [c1, . . . , cp] ∈ R . Note j. The weights of Hopfield network are learned using that W is a symmetric matrix, i.e., wij = wji. Also, as the Hebbian learning (Hebb’s law of association) (Hebb, there is no link from a node to itself, the diagonal elements 1949): of L and J are zero, i.e., lii = jii = 0, ∀i. Restricted Boltzmann Machine (RBM) is BM which does not have  links within a layer, i.e., there is no any link between the xi × xj if i 6= j, wij := (10) elements of v and no any link between the elements of 0 otherwise. h. In other words, the links are restricted in RBM to be L = J = 0. Figure1 depicts both BM and RBM with After training, the outputs of units can be determined for their layers and links. In this section, we focus on RBM. an input if the weighted summation of inputs to unit passes Recall that RBM is an Ising model. As we saw in Eq. (9), Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey 5

Figure 1. The structures of (a) a Boltzmann machine and (b) a restricted Boltzmann machine.

> > > the energy of an Ising model can be modeled as (Hinton & (a) exp(b v) exp(c h) exp(v W h) = > P > > Sejnowski, 1983; Ackley et al., 1985): exp(b v) p exp(c h) exp(v W h) h∈R > > > > > R 3 E(v, h) := −b v − c h − v W h, (12) exp(c h) exp(v W h) = P > > , p exp(c h) exp(v W h) which is based on interactions between linked units. As h∈R introduced in Eq. (2), the visible and hidden variables make where (a) is because the term exp(b>v) does not have h in a joint Boltzmann distribution (Hinton, 2012): P it. Note that p denotes summation over all possible h∈R 1 p-dimensional hidden variables for the sake of marginaliza- P(v, h) = exp(−E(v, h)) 0 P > > Z tion. Let Z := p exp(c h) exp(v W h). Hence: (13) h∈R (12) 1 > > > = exp(b v + c h + v W h), 1 Z (h|v) = exp(c>h + v>W h) P Z0 where Z is the partition function: p p 1  X X  X X = exp c h + v>W h Z := exp(−E(v, h)). (14) Z0 j j :j j p j=1 j=1 v∈ d h∈R R p 1 Y According to Lemma1, the BM and RBM try to reduce = exp(c h + v>W h ), (15) Z0 j j :j j the energy of model. Training the BM or RBM reduces its j=1 energy (Hinton & Sejnowski, 1983; Ackley et al., 1985). d where W :j ∈ R denotes the j-th column of matrix W . 3.2. Conditional Distributions The Eq. (15) shows that given the visible variables, the hid- Proposition 1 (Conditional Independence of Variables). In den variables are conditionally independent because their RBM, given the visible variables, the hidden variables are joint distribution is the product of every distribution. We conditionally independent. Likewise, given the hidden vari- can write similar expressions for the probability P(v|h): ables, the visible variables are conditionally independent. 1 This does not hold in BM because of the links within each > > P(v|h) = 00 exp(b v + v W h) layer. Z d d 1  X X  = exp b v + v W h Proof. According to the Bayes rule, we have: Z00 i i i i: i=1 i=1 P(h, v) P(v, h) d P(h|v) = = P 1 Y P(v) p P(v, h) = exp(b v + v W h), (16) h∈R Z00 i i i i: 1 > > > i=1 (13) exp(b v + c h + v W h) = Z P 1 > > > p h∈ p Z exp(b v + c h + v W h) where W i: ∈ R denotes the i-th row of matrix W and R 00 P > > 1 > > > Z := d exp(b v) exp(v W h). This equation exp(b v) exp(c h) exp(v W h) v∈R = Z shows that the given the hidden variables, the visible vari- 1 P > > > p exp(b v) exp(c h) exp(v W h) Z h∈R ables are conditionally independent. Q.E.D. Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey 6

According to Eq. (15) and considering the rule (h|v) = P 1 Input: visible dataset v, (initialization: optional) (h, v)/ (v), we have: P P 2 Get initialization or do random initialization of v p 3 while until burn-out do 1 Y > 4 for j from 1 to p do P(h|v) = 0 exp(cjhj + v W :jhj) Z (ν) (ν) j=1 5 hj ∼ P(hj|v ) p 1 Y 6 for i from 1 to d do = 0 P(hj, v) (ν+1) (ν) Z 7 v ∼ (v |h ) j=1 i P i

> =⇒ P(hj, v) = exp(cjhj + v W :jhj) Algorithm 1: Gibbs sampling in RBM d X (17) = exp(cjhj + viwijhj). i=1 3.3.2. GENERATIONS AND EVALUATIONS BY GIBBS Similarly, according to Eq. (16) and considering the rule SAMPLING P(v|h) = P(h, v)/P(h), we have: Gibbs sampling for generating both observation and hid- d den units is used for both training and evaluation phases 1 Y (h|v) = exp(b v + v W h) of RBM. Use of Gibbs sampling in training RBM will be P Z00 i i i i: i=1 explained in Sections 3.4 and 3.5. After the RBM model d is trained, we can generate any number of p-dimensional 1 Y = (h, v ) hidden variables as a meaningful representation of the d- Z00 P i i=1 dimensional observation using Gibbs sampling. More- over, using Gibbs sampling, we can generate other d- =⇒ P(h, vi) = exp(bivi + viW i: h) dimensional observations in addition to the original dataset. p X (18) These new generated observations are d-dimensional rep- = exp(bivi + viwijhj). resentations for the p-dimensional hidden variables. This j=1 shows that BM and RBM are generative models.

We will use these equations later. 3.4. Training Restricted Boltzmann Machine by 3.3. Sampling Hidden and Visible Variables Maximum Likelihood Estimation The weights of links which are W , b, and c should be 3.3.1. GIBBS SAMPLING learned so that we can use them for sampling/generating We can use Gibbs sampling for sampling and generating the hidden and visible units. Consider a dataset of n visible the hidden and visible units. If ν denotes the iteration index d n vectors {vi ∈ R }i=1. Note that vi should not be confused of Gibbs sampling, we iteratively sample: with vi where the former is the i-th visible data instance and (ν) (ν) the latter is the i-th visible unit. We denote the j-th dimen- h ∼ P(h|v ), (19) > sion of vi by vi,j; in other words, vi = [vi,1,..., vi,d] . (ν+1) (ν) v ∼ P(v|h ), (20) The log-likelihood of the visible data is: until the burn-out convergence. As was explained before, n n only several iterations of Gibbs sampling are usually suf- X X  X  ficient. After the burn-out, the samples are approximately `(W , b, c) = log(P(vi)) = log P(vi, h) p samples from the joint distribution (v, h). As the vari- i=1 i=1 h∈R P n ables are conditionally independent, this Gibbs sampling (13) X  X 1  = log exp(−E(v , h)) can be implemented as in Algorithm1. In this algo- Z i i=1 h∈ p (ν) (ν) R rithm, hj ∼ P(hj|v ) can be implemented as drawing n X  1 X  a sample from uniform distribution u ∼ U[0, 1] and com- = log exp(−E(v , h)) Z i pare it to the value of Probability Density Function (PDF), p i=1 h∈R (h |v(ν)). If u is less than or equal to this value, we n P j X h  X  i have hj = 1; otherwise, we have hj = 0. Implementa- = log exp(−E(vi, h)) − log Z p tion of sampling vi has a similar procedure. Alternatively, i=1 h∈R n we can use inverse of Cumulative Distribution Function X  X  of these distributions for drawing samples (see (Ghojogh = log exp(−E(vi, h)) − n log Z p et al., 2020) for more details about sampling). i=1 h∈R Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey 7

n (14) X  X  Setting this derivative to zero does not give us a closed- = log exp(−E(v , h)) i form solution. Hence, we should learn the parameters iter- p i=1 h∈R X X atively using gradient ascent for MLE. − n log exp(−E(v, h)). (21) Now, consider each of the parameters θ = {W , b, c}. The d p v∈R h∈R derivative w.r.t. these parameters in Eq. (25) are: We use Maximum Likelihood Estimation (MLE) for find- ing the parameters θ := {W , b, c}. The derivative of log- (12) ∂ > > > > ∇W (−E(v, h)) = (b v + c h + v W h) = vh , likelihood with respect to parameter θ is: ∂W (12) ∂ n ∇ (−E(v, h)) = (b>v + c>h + v>W h) = v, X  X  b ∂b ∇θ`(θ) =∇θ log exp(−E(vi, h)) (12) i=1 h∈ p ∂ > > > R (22) ∇c(−E(v, h)) = (b v + c h + v W h) = h. X X ∂c − n∇θ log exp(−E(v, h)). d p v∈R h∈R Therefore, Eq. (25) for these parameters becomes:

The first term of this derivative is: n X n ∇ `(θ) = [vh>] − n [vh>] X  X  W E∼P(h|vi) i E∼P(h,v) ∇θ log exp(−E(vi, h)) i=1 i=1 h∈ p n R X n = v [h>] − n [vh>], X  X  i E∼P(h|vi) E∼P(h,v) = ∇θ log exp(−E(vi, h)) i=1 p i=1 h∈R n P X ∇θ h∈ p exp(−E(vi, h)) n = P R X p exp(−E(vi, h)) ∇ `(θ) = [v ] − n [v] i=1 h∈R b E∼P(h|vi) i E∼P(h,v) n P i=1 X h∈ p exp(−E(vi, h))∇θ(−E(vi, h)) n = R P X p exp(−E(vi, h)) i=1 h∈R = vi − n E∼P(h,v)[v], n i=1 (a) X = [∇ (−E(v , h))], E∼P(h|vi) θ i (23) i=1 n X (a) ∇ `(θ) = [h] − n [h]. where is because the definition of expectation is c E∼P(h|vi) E∼P(h,v) P i=1 E∼P[x] := i=1 P(xi) xi. However, if P is not an ac- tual distribution and does not sum to one, we should nor- malize it to behave like a distribution in the expectation: If we define: P P E∼P[x] := ( i=1 P(xi) xi)/( i=1 P(xi)). The second hbi := ∼ (h|v )[h], (26) term of the derivative of log-likelihood is: E P i X X − n∇θ log exp(−E(v, h)) we can summarize these derivatives as: d p v∈R h∈R n P P d×p X > > ∇θ d p exp(−E(v, h)) 3 ∇W `(θ) = vihb − n ∼ (h,v)[vh ], (27) = −n v∈R h∈R R i E P P P i=1 v∈ d h∈ p exp(−E(v, h)) R R n P P X v∈ d h∈ p ∇θ exp(−E(v, h)) d 3 ∇ `(θ) = v − n [v], (28) = −n P R P R R b i E∼P(h,v) d p exp(−E(v, h)) v∈R h∈R i=1 P P n d p exp(−E(v, h))∇θ(−E(v, h)) = −n v∈R h∈R p X P P R 3 ∇c`(θ) = hbi − n E∼P(h,v)[h]. (29) d p exp(−E(v, h)) v∈R h∈R i=1 (a) = −n [∇ (−E(v, h))], (24) E∼P(h,v) θ Setting these derivatives to zero does not give a closed form where (a) is for the definition of expectation which was solution. Hence, we need to find the solution iteratively us- already explained. In summary, the derivative of log- ing gradient descent where the above gradients are used. likelihood is: In the derivatives of log-likelihood, we have two types of n expectation. The conditional expectation [.] is X E∼P(h|vi) ∇ `(θ) = [∇ (−E(v , h))] based on the observation or data which is v . The joint θ E∼P(h|vi) θ i i i=1 (25) expectation E∼P(h,v)[.], however, has nothing to do with the observation and is merely about the RBM model. − n E∼P(h,v)[∇θ(−E(v, h))]. Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey 8

3.5. Contrastive Divergence n 1 Input: training data {xi}i=1 According to Eq. (23), the conditional expectation used 2 Randomly initialize W , b, c in Eq. (26) includes one summation. Moreover, accord- 3 while not converged do ing to Eq. (24), the joint expectations used in Eqs. (27), 4 Sample a mini-batch {v1,..., vm} from (28), and (29) are contains two summations. This double- n training dataset {xi}i=1 (n.b. we may set summation makes computation of the joint expectation in- m = n) tractable because it sums over all possible values for both 5 // Gibbs sampling for each data point: hidden and visible units. Therefore, exact computation of (0) 6 Initialize v ← v for all i ∈ {1, . . . , m} MLE is hard and we should approximate it. One way to bi i 7 for i from 1 to m do approximate computation of joint expectations in MLE is (0) 8 ← v contrastive divergence (Hinton, 2002). Contrastive diver- Algorithm1 bi p d gence improves the efficiency and reduces teh variance of 9 {hi}i=1, {vi}i=1 ← Last iteration of estimation in RBM (Hinton, 2002; Welling et al., 2004). Algorithm1 > The idea of contrastive divergence is as follows. First, 10 hei ← [h1, . . . , hp] > 11 v ← [v , . . . , v ] we obtain a point ve using Gibbs sampling starting from ei 1 d the observation v (see Section 3.3 for Gibbs sampling in 12 h ← [h] i bi E∼P(h|vi) RBM). Then, we compute expectation by using only that 13 // gradients: one point v. The intuitive reason for why contrastive di- e Pm > Pm > vergence works is explained in the following. We need to 14 ∇W `(θ) ← i=1 vihbi − i=1 heivei Pm Pm minimize the gradients to find the solution of MLE. In the 15 ∇b`(θ) ← i=1 vi − i=1 vei Pm Pm joint expectations in Eqs. (27), (28), and (29), rather than 16 ∇c`(θ) ← i=1 hbi − i=1 hei considering all possible values of observations, contrastive 17 // gradient descent for updating solution: divergence considers only one of the data points (obser- 18 W ← W − η∇W `(θ) vations). If this observation is a wrong belief which we 19 b ← b − η∇b`(θ) do not wish to see in generation of observations by RBM, 20 c ← c − η∇c`(θ) contrastive divergence is performing a task which is called 21 Return W , b, c negative sampling (Hinton, 2012). In negative sampling, we say rather than training the model to not generate all Algorithm 2: Training RBM using contrastive di- wrong observations, we train it iteratively but less ambi- vergence tiously in every iteration. Each iteration tries to teach the model to not generate only one of the wrong outputs. Grad- ually, the model learns to generate correct observations by (29) become: avoiding to generate these negative samples. n n > > > X X Let he = [eh1,..., ehm] be the corresponding sampled h to ∇W `(θ) = vihbi − veihei , (31) > i=1 i=1 v = [v1,..., vm] in Gibbs sampling. According to the e e e n n above explanations, contrastive divergence approximates X X ∇ `(θ) = v − v , the joint expectation in the derivative of log-likelihood, Eq. b i ei (32) i=1 i=1 (25), by Monte-Carlo approximation (Ghojogh et al., 2020) n n evaluated at vi and hi for the i-th observation and hidden X X e e ∇c`(θ) = hbi − hei. (33) units where vei and hei are found by Gibbs sampling. Hence: i=1 i=1 These equations make sense because when the observation E∼P(h,v)[∇θ(−E(v, h))] ≈ and hidden variable given the variable become equal to the 1 n X approximation by Gibbs sampling, the gradient should be ∇θ(−E(vi, hi)) . (30) n vi=vi,hi=hei i=1 e zero and the training should stop. Note that some works in the literature restate Eqs. (31), (32), and (33) as (Hinton, Experiments have shown that a small number of iterations 2002; 2012; Taylor et al., 2007): in Gibbs sampling suffice for contrastive divergence. Pa- per (Hinton, 2002) even uses one iteration of Gibbs sam- ∀i, j : ∇wij `(θ) = hvihjidata − hvihjirecon., (34) pling for this task. This small number of required iterations ∀i: ∇bi `(θ) = hviidata − hviirecon., (35) has the support of literature because Gibbs sampling is a ∀j : ∇ `(θ) = hh i − hh i , (36) special case of Metropolis-Hastings algorithms (Ghojogh cj j data j recon. et al., 2020) which are fast (Dwivedi et al., 2018). where h.idata and h.irecon. denote expectation over data and By the approximation in Eq. (30), the Eqs. (27), (28), and reconstruction of data, respectively. Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey 9

The training algorithm of RBM, using contrastive diver- where {fia(vi), gjb(hj)} are the sufficient statistics, gence, can be found in Algorithm2. In this algorithm, we {θi, λj} are the canonical parameters of the mod- are using mini-batch gradient descent with the batch size els, {Ai,Bj} are the log-normalization factors, and m. If training dataset is not large, one can set m = n to {ri(vi), sj(hj)} are the normalization factors which are have gradient descent. This algorithm is iterative until con- some additional features multiplied by some constants. We vergence where, in every iteration, a mini-batch is sampled can ignore the log-normalization factors because they are d where we have an observation vi ∈ R and a hidden vari- hard to compute. p able hi ∈ R for every i-th training data point. For every For the joint distribution of visible and hidden variables, we data point, we apply Gibbs sampling as shown in Algo- should introduce a quadratic term for their cross-interaction rithm1. After Gibbs sampling, gradients are calculated by (Welling et al., 2004): Eqs. (31), (32), and (33) and then the variables are updated d using a gradient descent step.  X X P(v, h) ∝ exp θia fia(vi) 3.6. Boltzmann Machine i=1 a p So far, we introduced and explained RBM. BM has more X X + λ g (h ) links compared to RBM (Salakhutdinov & Hinton, 2009a). jb jb j j=1 b Here, in parentheses, we briefly introduce training of BM. d p Its structure is depicted in Fig.1. As was explained in X X X X jb  d×d + W ia fia(vi) gjb(hj) . (43) Section 3.1, BM has additional links L = [lij] ∈ R and p×p d×p i=1 j=1 a b J = [jij] ∈ R . The weights W ∈ R and biases b ∈ Rd and c ∈ Rp are trained by gradient descent using According to Proposition1, the visible and hidden units the gradients in Eqs. (31), (32), and (33). The additional have conditional independence. Therefore, the conditional weights L and J are updated similarly using the following distributions can be written as multiplication of exponential gradients (Salakhutdinov & Hinton, 2009a): family distributions (Welling et al., 2004): n n d X > X >   ∇L`(θ) = vivi − vivi , (37) Y X e e P(v|h) = exp θbia fia(vi) − Ai({θbia}) , (44) i=1 i=1 i=1 a n n p X > X >   ∇J `(θ) = E∼ (h|vi)[hh ] − heihei . (38) Y X P P(h|v) = exp λbjb gjb(hj) − Bj({λbjb}) , (45) i=1 i=1 j=1 b These equations can be restated as: where: ∀i, j : ∇l `(θ) = hvivjidata − hvivjirecon., (39) ij p ∀i, j : ∇j `(θ) = hhihjidata − hhihjirecon., (40) X X jb ij θbia := θia + W ia gjb(hj), (46) j=1 b where h.idata and h.irecon. denote expectation over data and reconstruction of data, respectively. d X X jb λbjb := λjb + W ia fia(vi). (47) 4. Distributions of Visible and Hidden i=1 a Variables Therefore, we can choose one of the distributions in the 4.1. Modeling with Exponential Family Distributions exponential family for the conditional distributions of vis- ible and hidden variables. In the following, we introduce According to Proposition1, the units v ∈ d and h ∈ p R R different cases where the units can have either discrete or have conditional independence so their distribution is the continuous values. In all cases, the distributions are from product of each conditional distribution. We can choose exponential families. distributions from exponential family for the visible and hidden variables (Welling et al., 2004): 4.2. Binary States d Y  X  The hidden and visible variables can have discrete number P(v) = ri(vi) exp θia fia(vi) − Ai({θia}) , of values, also called states. Most often, inspired by the i=1 a Hopfield network, BM and RBM have binary states. In this (41) case, the hidden and visible units can have binary states, p Y  X  i.e., vi, hj ∈ {0, 1}, ∀i, j. Hence, we can say: P(h) = sj(hj) exp λjb gjb(hj) − Bj({λjb}) , j=1 b P(hj = 1, v) P(hj = 1|v) = . (48) (42) P(hj = 0, v) + P(hj = 1, v) Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey 10

In binary states, the joint probability in Eq. (17) is simpli- 4.3. Continuous Values fied to: In some cases, we set the hidden units to have continuous > values as continuous representations for the visible unit. P(hj = 0, v) = exp(cj × 0 + v W :j × 0) According to the definition of conditional probability, we = exp(0) = 1, have: > P(hj = 1, v) = exp(cj × 1 + v W :j × 1) > P(vi = 1, h) = exp(cj + v W :j). P(vi = 1|h) = P . (56) d (h, vi) vi∈R P Hence, Eq. (48) becomes: > According to Eq. (18), we have: exp(cj + v W :j) P(hj = 1|v) = > 1 + exp(cj + v W :j) P(h, vi = 1) = exp(bi × 1 + 1 × W i: h) 1 = = σ(c + v>W ), = exp(bi + W i: h). (57) >  j :j 1 + exp − (cj + v W :j) (49) Hence, Eq. (56) becomes: where: exp(bi + W i: h) P(vi = 1|h) = P . (58) exp(x) 1 d (h, vi) σ(x) := = , vi∈R P 1 + exp(x) 1 + exp(−x) d This is a softmax function which can approximate a Gaus- is the sigmoid (or logistic) function and W :j ∈ R denotes sian (normal) distribution and it sums to one. Therefore, we the j-th column of matrix W . If the visible units also have can write is as the normal distribution with variance one: binary states, we will similarly have: p (v = 1|h) = σ(b + W h), (50) X P i i i: P(vi|h) = N (bi + W i: h, 1) = N (bi + wijhj, 1). p j=1 where W i: ∈ R denotes the i-th row of matrix W . As we have only two states {0, 1}, from Eqs. (49) and (50), we (59) have: According to Proposition1, the units have conditional in- d X dependence so their distribution is the product of each con- (h |v) = σ(c + v>W ) = σ(c + v w ), (51) P j j :j j i ij ditional distribution: i=1 p d d X Y Y P(vi|h) = σ(bi + W i: h) = σ(bi + wij hj). (52) P(v|h) = P(vi|h) = N (bi + W i: h, 1). (60) j=1 i=1 i=1 According to Proposition1, the units have conditional in- Usually, when the hidden units have continuous values, the dependence so their distribution is the product of each con- visible units have binary states (Welling et al., 2004; Mo- ditional distribution: hamed & Hinton, 2010). In this case, the conditional distri- p p Y Y bution (h|v) is obtained by Eq. (53). If the visible units (h|v) = (h |v) = σ(c + v>W ), (53) P P P j j :j have continuous values, their distribution can be similarly j=1 j=1 calculated as: d d Y Y (v|h) = (v |h) = σ(b + W h). (54) p p P P i i i: Y Y > i=1 i=1 P(h|v) = P(hj|v) = N (cj + v W :j, 1). (61) Therefore, in Gibbs sampling of Algorithm1, we sample j=1 j=1 from the distributions of Eqs. (51) and (52). Note that the These normal distributions can be used for sampling in is between zero and one so we can use the Gibbs sampling of Algorithm1. In this case, Eq. (26), uniform distribution u ∼ U[0, 1] for sampling from it. This used in Algorithm2 for training RBM, is: was explained in Section 3.3. Moreover, for binary states, > > we have ∼ (hj |v)[hj] = σ(cj + v W :j). Hence, if we h = [h] = N (c + v W , I ), E P bi E∼P(h|vi) i p×p (62) apply the sigmoid function element-wise on elements of p hbi ∈ R , we can have this for Eq. (26) in training binary- where I denotes the identity matrix. state RBM: 4.4. Discrete Poisson States h = [h] = σ(c + v>W ). (55) bi E∼P(h|vi) i In some cases, the units have discrete states but with more This equation is also used in Algorithm2. than two values which was discussed in Section 4.2. In this Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey 11 case, we can use the well-known Poisson distribution for Eqs. (32) and (33). Similarly, for updating the added links discrete random variables: from τ previous time steps, we have: −λ t n n e λ d (t−τ) X (t) X (t) Ps(t, λ) = . ∀i: R 3 ∇ (t−τ) `(θ) = vi vk − vk , t! Gi: e k=1 k=1 Assume every visible unit can have a value t ∈ (65) {0, 1, 2, 3,... }. If we consider the conditional Poisson dis- n n p (t−τ) X X  tribution for the visible units, we have (Salakhutdinov & ∀i: R 3 ∇ (t−τ) `(θ) = vi hbk − hek , Qi: Hinton, 2009b): k=1 k=1 (66) exp(bi + W i: h) (v = t|h) = Ps(t, ). (63) (t−τ) P i Pd where i ∈ {1, . . . , d}, τ ∈ {1, 2,..., T}, G denotes k=1 exp(bk + W k: h) i: the i-th row of G(t−τ), and Q(t−τ) denotes the i-th row of Similarly, if we have discrete states for the hidden units, we i: Q(t−τ). These equations are similar to Eqs. (32) and (33) can have: but they are multiplied by the visible values in previous exp(c + v>W ) time steps. This is because RBM multiplies biases with (h = t|v) = Ps(t, j :j ), (64) P j Pp > one (see Fig.1) while these newly introduced biases have k=1 exp(ck + v W :k) previous visible values instead of one (see Fig.2). These These Poisson distributions can be used for sampling in equations can be restated as (Taylor et al., 2007): Gibbs sampling of Algorithm1. In this case, Eq. (26), used (t−τ) (t) (t)  in Algorithm2 for training RBM, can be calculated using ∀i, j : ∇ (t−τ) `(θ) = vi hvi idata − hvi irecon. , gij a multivariate Poisson distribution (Edwards, 1962). It is (67) noteworthy that RBM has been used for semantic hashing (t−τ) (t) (t)  where the hidden variables are used as a hashing represen- ∀i, j : ∇ (t−τ) `(θ) = vi hhj idata − hhj irecon. , qij tation of data (Salakhutdinov & Hinton, 2009b). Semantic (68) hashing uses Poisson distribution and sigmoid function for the conditional visible and hidden variables, respectively. where h.idata and h.irecon. denote expectation over data and reconstruction of data, respectively. In addition to the 5. Conditional Restricted Boltzmann Machine weights and biases of RBM, the additional links are learned by gradient descent using the above gradients. Algorithm2 If data are a time-series, RBM does not include its tempo- can be used for training CRBM if learning the added links ral (time) information. In other words, RBM is suitable for is also included in the algorithm. static data. Conditional RBM (CRBM), proposed in (Tay- lor et al., 2007), incorporates the temporal information into Interpolating CRBM (ICRBM) (Mohamed & Hinton, the configuration of RBM. It considers visible variables of 2010) is an improvement over the CRBM where some links previous time steps as conditional variables. CRBM adds have been added from visible variables in the future. Figure two sets of directed links to RBM. The first set of directed 2 depicts the structure of ICRBM. Its training and formu- lation is similar but we do not cover its theory in this paper links is the autoregressive links from the past T1 visible units to the visible units of current time step. The second for the sake of brevity. Note that CRBM has been used in various time-series applications such as action recognition set of directed links is the links from the past τ2 visible units to the hidden units of the current time step. In gen- (Taylor et al., 2007) and acoustics (Mohamed & Hinton, 2010). eral, T1 is not necessarily equal to T2 but for simplicity, we usually set T1 = T2 = T (Taylor et al., 2007). We denote the links from the visible units at time t − τ to visible units 6. Deep Belief Network (t−τ) d×d at current time by G = [gij] ∈ R . We also denote 6.1. Stacking RBM Models the links from the visible units at time t − τ to hidden units We can train a neural network using RBM training (Hin- (t−τ) d×p at current time by Q = [qij] ∈ R . The structure ton & Salakhutdinov, 2006; Hinton et al., 2006). Train- of CRBM is shown in Fig.2. Note that each arrow in this ing neural network using RBM training can result in very figure is a set of links representing a matrix or vector of good initialization of weights for training network using weights. backpropagation. Before the development of ReLu (Glorot The updating rule for the weights W and biases b and c are et al., 2011) and dropout (Srivastava et al., 2014), multi- the same as in Eqs. (31), (32), and (33). We consider the player networks could not become deep for the directed links from the previous visible units to current vis- problem of vanishing gradients. This was because random ible units and current hidden units as dynamically changing initial weights were not suitable enough for starting opti- biases. Recall that for updating the biases b and c, we used mization in backpropagation, especially in deep networks. Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey 12

Figure 2. The structures of (a) a conditional Boltzmann machine and (b) an interpolating conditional restricted Boltzmann machine. Note that each arrow in this figure represents a set of links.

Therefore, a method was proposed for pre-training neu- Now, we fine-tune the weights and biases using backprop- ral networks which initializes network to a suitable set of agation (Rumelhart et al., 1986). weights and then the pre-trained weights are fine-tuned us- The explained training algorithm was first proposed in ing backpropagation (Hinton & Salakhutdinov, 2006; Hin- (Hinton & Salakhutdinov, 2006; Hinton et al., 2006) and ton et al., 2006). was used for dimensionality reduction. By increasing ` to A neural network consists of several layers. Let ` denote any large number, the network becomes large and deep. As the number of layers, where the first layer gets the input layers are trained one by one as RBM models, we can make data, and let p` be the number of neurons in the `-th layer. the network as deep as we want without being worried for By convention, we have p1 = d. We can consider every two vanishing gradients because weights are initialized well for successive layers as one RBM. This is shown in Fig.3. We backpropagation. As this network can get deep and is pre- start from the first pair of layers as an RBM and we intro- trained by belief propagation (RBM training), it is referred d n duce training dataset {xi ∈ R }i=1 as the visible variable to as the Deep Belief Network (DBN) (Hinton et al., 2006; n {vi}i=1 of the first pair of layers. We train the weights and Hinton, 2009). DBN can be seen as a stack of RBM mod- biases of this first layer as an RBM using Algorithm2. Af- els. The pre-training of a DBN using RBM training is de- ter training this RBM, we generate n p2-dimensional hid- picted in Fig.3. This algorithm is summarized in Algo- p ×p den variables using Gibbs sampling in Algorithm1. Now, rithm3. In this algorithm, W l ∈ R l l+1 denotes the p we consider the hidden variables of the first RBM as the weights connecting layer l to layer (l + 1) and bl ∈ R l visible variables for the second RBM (the second pair of denotes the biases of the layer l. Note that as weights are `−1 layers). Again, this RBM is trained by Algorithm2 and, between every two layers, the sets of weights are {W l}l=1 . then, hidden variables are generated using Gibbs sampling Note that pre-training of DBN is an unsupervised task be- in Algorithm1. This procedure is repeated until all pairs cause RBM training is unsupervised. Fine-tuning of DBN of layers are trained using RBM training. This layer-wise can be either unsupervised or supervised depending on the training of neural network has a greedy approach (Bengio loss function for backpropagation. If the DBN is an au- et al., 2007). This greedy training of layers prepares good toencoder with a low-dimensional middle layer in the net- initialized weights and biases for the whole neural network. work, both its pre-training and fine-tuning stages are un- Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey 13

n 1 Input: training data {xi}i=1 2 // pre-training: 3 for l from 1 to ` − 1 do 4 if l = 1 then n n 5 {vi}i=1 ← {xi}i=1 6 else 7 // generate n hidden variables of previous RBM: n 8 {hi}i=1 ← Algorithm1 for (l − 1)-th n RBM ← {vi}i=1 n n 9 {vi}i=1 ← {hi}i=1

10 W l, bl, bl+1 ← Algorithm2 for l-th RBM n ← {vi}i=1 11 // fine-tuning using backpropagation: `−1 12 Initialize network with weights {W l}l=1 and ` biases {bl}l=2. `−1 ` 13 {W l}l=1 , {bl}l=1 ← Backpropagate the error of loss fro several epochs. Figure 3. Pre-training a deep belief network by considering every Algorithm 3: Training a deep belief network pair of layers as an RBM. supervised because the loss function of backpropagation is also a mean squared error. This DBN can learn a low-dimensional embedding or representation of data and can be used for dimensionality reduction (Hinton & Salakhutdinov, 2006). The DBN autoencoder has also been used for hashing (Salakhutdinov & Hinton, 2009b). The structure of this network is depicted in Fig.4.

6.2. Other Improvements over RBM and DBN Some improvements over DBN are convolutional DBN (Krizhevsky & Hinton, 2010) and use of DBN for hashing (Salakhutdinov & Hinton, 2009b). Greedy training of DBN using RBM training has been used in training the t-SNE network for general degree of freedom (Van Der Maaten, 2009). In addition to CRBM (Taylor et al., 2007), recurrent RBM (Sutskever et al., 2009) has been proposed to han- dle the temporal information of data. Also, note that there exist some other energy-based models in addition to BM; Helmholtz machine (Dayan et al., 1995) is an example. There also exists Deep Boltzmann Machine (DBM) which is slightly different from DBN. For the sake of brevity, we do not cover it here and refer the interested reader to Figure 4. A DBN autoencoder where the number of neurons in (Salakhutdinov & Hinton, 2009a). Various efficient train- the corresponding layers of encoder and decoder are usually set ing algorithms have been proposed for DBM (Salakhutdi- to be equal. The coder layer is a low-dimensional embedding for nov & Larochelle, 2010; Salakhutdinov, 2010; Salakhut- representation of data. dinov & Hinton, 2012; Srivastava & Salakhutdinov, 2012; Hinton & Salakhutdinov, 2012; Montavon & Muller¨ , 2012; Goodfellow et al., 2013; Srivastava & Salakhutdinov, 2014; 7. Conclusion Melchior et al., 2016). Some of the applications of DBM This was a tutorial paper on BM, RBM, and DBN. Af- are document processing (Srivastava et al., 2013), face ter some background, we covered the structure of BM and modeling (Nhan Duong et al., 2015) RBM, Gibbs sampling in RBM for generating visible and Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey 14 hidden variables, training RBM using contrastive diver- Edwards, Carol Bates. Multivariate and multiple Poisson gence, and training BM. Then, we introduced various cases distributions. Iowa State University, 1962. for states of visible and hidden units. Thereafter, CRBM and DBN were explained in detail. Geman, Stuart and Geman, Donald. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of im- Acknowledgement ages. IEEE Transactions on pattern analysis and ma- chine intelligence, PAMI-6(6):721–741, 1984. The authors hugely thank Prof. Mehdi Molkaraie for his course which partly covered some materials on Ising model Ghojogh, Benyamin and Crowley, Mark. The the- and statistical physics. Some of the materials in this tutorial ory behind overfitting, cross validation, regulariza- paper have been covered by Prof. Ali Ghodsi’s videos on tion, bagging, and boosting: tutorial. arXiv preprint YouTube. arXiv:1905.12787, 2019. References Ghojogh, Benyamin, Nekoei, Hadi, Ghojogh, Aydin, Kar- Ackley, David H, Hinton, Geoffrey E, and Sejnowski, Ter- ray, Fakhri, and Crowley, Mark. Sampling algorithms, rence J. A learning algorithm for Boltzmann machines. from survey sampling to Monte Carlo methods: Tutorial arXiv preprint arXiv:2011.00901 Cognitive science, 9(1):147–169, 1985. and literature review. , 2020. Alberici, Diego, Barra, Adriano, Contucci, Pierluigi, and Mingione, Emanuele. Annealing and replica-symmetry Gibbs, J Willard. Elementary principles in statistical me- in deep Boltzmann machines. Journal of Statistical chanics. Courier Corporation, 1902. Physics, 180(1):665–677, 2020. Glorot, Xavier, Bordes, Antoine, and Bengio, Yoshua. Alberici, Diego, Contucci, Pierluigi, and Mingione, Deep sparse rectifier neural networks. In Proceedings of Emanuele. Deep Boltzmann machines: rigorous results the fourteenth international conference on artificial in- at arbitrary depth. In Annales Henri Poincare´, pp. 1–24. telligence and statistics, pp. 315–323. JMLR Workshop Springer, 2021. and Conference Proceedings, 2011. Bengio, Yoshua, Lamblin, Pascal, Popovici, Dan, and Goodfellow, Ian, Mirza, Mehdi, Courville, Aaron, and Larochelle, Hugo. Greedy layer-wise training of deep Bengio, Yoshua. Multi-prediction deep Boltzmann ma- networks. In Advances in neural information processing chines. Advances in Neural Information Processing Sys- systems, pp. 153–160, 2007. tems, 26:548–556, 2013.

Bishop, Christopher M. Pattern recognition. Machine Hebb, Donald. The Organization of Behavior. Wiley & learning, 128(9), 2006. Sons, New York, 1949.

Boltzmann, Ludwig. Studien uber das gleichgewicht der Hinton, Geoffrey E. Training products of experts by min- lebenden kraft. Wissenschafiliche Abhandlungen, 1:49– imizing contrastive divergence. Neural computation, 14 96, 1868. (8):1771–1800, 2002. Boser, Bernhard E, Guyon, Isabelle M, and Vapnik, Hinton, Geoffrey E. Boltzmann machine. Scholarpedia, 2 Vladimir N. A training algorithm for optimal margin (5):1668, 2007. classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, pp. 144–152, 1992. Hinton, Geoffrey E. Deep belief networks. Scholarpedia, 4(5):5947, 2009. Brush, Stephen G. History of the Lenz-Ising model. Re- views of modern physics, 39(4):883, 1967. Hinton, Geoffrey E. A practical guide to training restricted Carroll, Sean. From eternity to here: the quest for the ulti- Boltzmann machines. In Neural networks: Tricks of the mate theory of time. Penguin, 2010. trade, pp. 599–619. Springer, 2012. Dayan, Peter, Hinton, Geoffrey E, Neal, Radford M, and Hinton, Geoffrey E and Salakhutdinov, Ruslan R. Reduc- Zemel, Richard S. The Helmholtz machine. Neural com- ing the dimensionality of data with neural networks. Sci- putation, 7(5):889–904, 1995. ence, 313(5786):504–507, 2006. Dwivedi, Raaz, Chen, Yuansi, Wainwright, Martin J, and Hinton, Geoffrey E and Salakhutdinov, Russ R. A better Yu, Bin. Log-concave sampling: Metropolis-Hastings way to pretrain deep Boltzmann machines. Advances in algorithms are fast! In Conference on learning theory, Neural Information Processing Systems, 25:2447–2455, pp. 793–797. PMLR, 2018. 2012. Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey 15

Hinton, Geoffrey E and Sejnowski, Terrence J. Optimal Mohamed, Abdel-rahman, Dahl, George, Hinton, Geof- perceptual inference. In Proceedings of the IEEE confer- frey, et al. Deep belief networks for phone recognition. ence on Computer Vision and Pattern Recognition, vol- In Nips workshop on for speech recogni- ume 448. IEEE, 1983. tion and related applications, volume 1, pp. 39. Vancou- ver, Canada, 2009. Hinton, Geoffrey E, Osindero, Simon, and Teh, Yee-Whye. A fast learning algorithm for deep belief nets. Neural Mohamed, Abdel-rahman, Dahl, George E, and Hinton, computation, 18(7):1527–1554, 2006. Geoffrey. Acoustic modeling using deep belief net- works. IEEE transactions on audio, speech, and lan- Hopfield, John J. Neural networks and physical systems guage processing, 20(1):14–22, 2011. with emergent collective computational abilities. Pro- ceedings of the national academy of sciences, 79(8): Molkaraie, Mehdi. The primal versus the dual ising model. 2554–2558, 1982. In 2017 55th Annual Allerton Conference on Communi- cation, Control, and Computing (Allerton), pp. 53–60. Huang, Kerson. Statistical Mechanics. John Wiley & Sons, IEEE, 2017. 1987. Ising, Ernst. Beitrag zur theorie des ferromagnetismus. Molkaraie, Mehdi. Marginal densities, factor graph duality, Zeitschrift fur¨ Physik, 31(1):253–258, 1925. and high-temperature series expansions. In International Conference on Artificial Intelligence and Statistics, pp. Karpathy, Andrej and Fei-Fei, Li. Deep visual-semantic 256–265, 2020. alignments for generating image descriptions. In Pro- ceedings of the IEEE conference on computer vision and Montavon, Gregoire´ and Muller,¨ Klaus-Robert. Deep pattern recognition, pp. 3128–3137, 2015. Boltzmann machines and the centering trick. In Neu- ral networks: tricks of the trade, pp. 621–637. Springer, Kirkpatrick, Scott, Gelatt, C Daniel, and Vecchi, Mario P. 2012. Optimization by simulated annealing. science, 220 (4598):671–680, 1983. Nhan Duong, Chi, Luu, Khoa, Gia Quach, Kha, and Bui, Tien D. Beyond principal components: Deep Boltzmann Koller, Daphne and Friedman, Nir. Probabilistic graphical machines for face modeling. In Proceedings of the IEEE models: principles and techniques. MIT press, 2009. Conference on Computer Vision and Pattern Recogni- tion, pp. 4786–4794, 2015. Krizhevsky, Alex and Hinton, Geoff. Convolutional deep belief networks on CIFAR-10. Unpublished manuscript, Passos, Leandro Aparecido and Papa, Joao Paulo. 40(7):1–9, 2010. Temperature-based deep Boltzmann machines. Neural LeCun, Yann, Chopra, Sumit, Hadsell, Raia, Ranzato, M, Processing Letters, 48(1):95–107, 2018. and Huang, F. A tutorial on energy-based learning. Pre- Ramsauer, Hubert, Schafl,¨ Bernhard, Lehner, Johannes, dicting structured data, 1, 2006. Seidl, Philipp, Widrich, Michael, Adler, Thomas, Gru- LeCun, Yann, Bengio, Yoshua, and Hinton, Geoffrey. Deep ber, Lukas, Holzleitner, Markus, Pavlovic,´ Milena, learning. nature, 521(7553):436–444, 2015. Sandve, Geir Kjetil, et al. Hopfield networks is all you need. arXiv preprint arXiv:2008.02217, 2020. Lenz, Wilhelm. Beitrsgeˇ zum verstsndnisˇ der magnetis- chen eigenschaften in festen ksrpern.ˇ Physikalische Z, Rumelhart, David E, Hinton, Geoffrey E, and Williams, 21:613–615, 1920. Ronald J. Learning representations by back-propagating errors. Nature, 323(6088):533–536, 1986. Little, William A. The existence of persistent states in the brain. Mathematical biosciences, 19(1-2):101–120, Salakhutdinov, Ruslan. Learning deep boltzmann machines 1974. using adaptive mcmc. In Proceedings of the 27th Inter- national Conference on Machine Learning, pp. 943–950, Melchior, Jan, Fischer, Asja, and Wiskott, Laurenz. How 2010. to center deep Boltzmann machines. The Journal of Ma- chine Learning Research, 17(1):3387–3447, 2016. Salakhutdinov, Ruslan and Hinton, Geoffrey. Deep Boltz- mann machines. In Artificial intelligence and statistics, Mohamed, Abdel-rahman and Hinton, Geoffrey. Phone pp. 448–455. PMLR, 2009a. recognition using restricted Boltzmann machines. In 2010 IEEE International Conference on Acoustics, Salakhutdinov, Ruslan and Hinton, Geoffrey. Semantic Speech and Signal Processing, pp. 4354–4357. IEEE, hashing. International Journal of Approximate Reason- 2010. ing, 50(7):969–978, 2009b. Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey 16

Salakhutdinov, Ruslan and Hinton, Geoffrey. An efficient learning procedure for deep Boltzmann machines. Neu- ral computation, 24(8):1967–2006, 2012. Salakhutdinov, Ruslan and Larochelle, Hugo. Efficient learning of deep Boltzmann machines. In Proceedings of the thirteenth international conference on artificial in- telligence and statistics, pp. 693–700. JMLR Workshop and Conference Proceedings, 2010. Srivastava, Nitish and Salakhutdinov, Ruslan. Multimodal learning with deep Boltzmann machines. In Advances in neural information processing systems, volume 1, pp. 2, 2012.

Srivastava, Nitish and Salakhutdinov, Ruslan. Multimodal learning with deep Boltzmann machines. Journal of Ma- chine Learning Research, 15(1):2949–2980, 2014. Srivastava, Nitish, Salakhutdinov, Ruslan R, and Hinton, Geoffrey E. Modeling documents with deep Boltzmann machines. arXiv preprint arXiv:1309.6865, 2013. Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929– 1958, 2014. Sutskever, Ilya, Hinton, Geoffrey E, and Taylor, Gra- ham W. The recurrent temporal restricted Boltzmann machine. In Advances in neural information processing systems, pp. 1601–1608, 2009. Taylor, Graham W, Hinton, Geoffrey E, and Roweis, Sam T. Modeling human motion using binary latent vari- ables. In Advances in neural information processing sys- tems, pp. 1345–1352, 2007.

Van Der Maaten, Laurens. Learning a parametric embed- ding by preserving local structure. In Artificial Intelli- gence and Statistics, pp. 384–391, 2009. Welling, Max, Rosen-Zvi, Michal, and Hinton, Geoffrey E. Exponential family harmoniums with an application to information retrieval. In Advances in neural information processing systems, volume 4, pp. 1481–1488, 2004.