Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey
Total Page:16
File Type:pdf, Size:1020Kb
To appear as a part of an upcoming textbook on dimensionality reduction and manifold learning. Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey Benyamin Ghojogh [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada Ali Ghodsi [email protected] Department of Statistics and Actuarial Science & David R. Cheriton School of Computer Science, Data Analytics Laboratory, University of Waterloo, Waterloo, ON, Canada Fakhri Karray [email protected] Department of Electrical and Computer Engineering, Centre for Pattern Analysis and Machine Intelligence, University of Waterloo, Waterloo, ON, Canada Mark Crowley [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada Abstract to be useful for modeling the physical systems statisti- This is a tutorial and survey paper on Boltz- cally (Huang, 1987). One of these systems was the Ising mann Machine (BM), Restricted Boltzmann Ma- model which modeled interacting particles with binary chine (RBM), and Deep Belief Network (DBN). spins (Lenz, 1920; Ising, 1925). Later, the Ising model We start with the required background on prob- was found to be able to be a neural network (Little, 1974). abilistic graphical models, Markov random field, Hence, Hopfield network was proposed which modeled an Gibbs sampling, statistical physics, Ising model, Ising model in a network for modeling memory (Hopfield, and the Hopfield network. Then, we intro- 1982). Inspired by the Hopfield network (Little, 1974; duce the structures of BM and RBM. The con- Hopfield, 1982), which was itself inspired by the physi- ditional distributions of visible and hidden vari- cal Ising model (Lenz, 1920; Ising, 1925), Hinton et. al. ables, Gibbs sampling in RBM for generating proposed Boltzmann Machine (BM) and Restricted Boltz- variables, training BM and RBM by maximum mann Machine (RBM) (Hinton & Sejnowski, 1983; Ack- likelihood estimation, and contrastive divergence ley et al., 1985). These models are energy-based models are explained. Then, we discuss different possi- (LeCun et al., 2006) and the names come from the Boltz- ble discrete and continuous distributions for the mann distribution (Boltzmann, 1868; Gibbs, 1902) used in variables. We introduce conditional RBM and these models. A BM has weighted links between two lay- how it is trained. Finally, we explain deep be- ers of neurons as well as links between neurons of every layer. RBM restricts these links to not have links between arXiv:2107.12521v1 [cs.LG] 26 Jul 2021 lief network as a stack of RBM models. This paper on Boltzmann machines can be useful in neurons of a layer. BM and RBM take one of the layers various fields including data science, statistics, as the layer of data and the other layer as a representation neural computation, and statistical physics. or embedding of data. BM and RBM are special cases of the Ising model whose weights (coupling parameters) are learned. BM and RBM are also special cases of the Hop- 1. Introduction field network whose weights are learned by maximum like- lihood estimation rather than the Hebbian learning method Centuries ago, the Boltzmann distribution (Boltzmann, (Hebb, 1949) which is used in the Hopfield network. 1868), also called the Gibbs distribution (Gibbs, 1902), was proposed. This energy-based distribution was found The Hebbian learning method, used in the Hopfield net- work, was very weak and could not generalize well to un- seen data. Therefore, backpropagation (Rumelhart et al., 1986) was proposed for training neural networks. Back- propagation was gradient descent plus the chain rule tech- Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey 2 nique. However, researchers found out soon that neural The remainder of this paper is as follows. We briefly review networks cannot get deep in their number of layers. This is the required background on probabilistic graphical models, because, in deep networks, gradients become very small in Markov random field, Gibbs sampling, statistical physics, the initial layers after many chain rules from the last layers Ising model, and the Hopfield network in Section2. The of network. This problem was called vanishing gradients. structure of BM and RBM, Gibbs sampling in RBM, train- This problem of networks plus the glory of theory in ker- ing RBM by maximum likelihood estimation, contrastive nel support vector machines (Boser et al., 1992) resulted in divergence, and training BM are explained in Section3. the winter of neural networks in the last years of previous Then, we introduce different cases of states for units in century until around 2006. RBM in Section4. Conditional RBM and DBN are ex- During the winter of neural networks, Hinton tried to save plained in Sections5 and6, respectively. Finally, Section7 neural networks from being forgotten in the history of ma- concludes the paper. chine learning. So, he returned to his previously pro- posed RBM and proposed a learning method for RBM with Required Background for the Reader the help of some other researchers including Max Welling This paper assumes that the reader has general knowledge (Hinton, 2002; Welling et al., 2004). They proposed train- of calculus, probability, linear algebra, and basics of opti- ing the weights of BM and RBM using maximum likeli- mization. The required background on statistical physics is hood estimation. BM and RBM can be seen as genera- explained in the paper. tive models where new values for neurons can be gener- ated using Gibbs sampling (Geman & Geman, 1984). Hin- 2. Background ton noticed RBM because he knew that the set of weights 2.1. Probabilistic Graphical Model and Markov between every two layers of a neural network is an RBM. Random Field It was in the year 2006 (Hinton & Salakhutdinov, 2006; Hinton et al., 2006) that he thought it is possible to train a A Probabilistic Graphical Model (PGM) is a grpah-based network in a greedy way (Bengio et al., 2007) where the representation of a complex distribution in the possibly weights of every layer of network is trained using RBM high dimensional space (Koller & Friedman, 2009). In training. This stack of RBM models with a greedy algo- other words, PGM is a combination of graph theory and rithm for training was named Deep Belief Network (DBN) probability theory. In a PGM, the random variables are rep- (Hinton et al., 2006; Hinton, 2009). DBN allowed the net- resented by nodes or vertices. There exist edges between works to become deep by preparing a good initialization of two variables which have interaction with one another in weights (using RBM training) for backpropagation. This terms of probability. Different conditional probabilities can good starting point for backpropagation optimization did be represented by a PGM. There exist two types of PGM not face the problem of vanishing gradients anymore. Since which are Markov network (also called Markov random the breakthrough in 2006 (Hinton & Salakhutdinov, 2006), field) and Bayesian network (Koller & Friedman, 2009). the winter of neural networks started to end gradually be- In the Markov network and Bayesian network, the edges cause the networks could get deep to become more nonlin- of graph are undirected and directed, respectively. BM and ear and handle more nonlinear data. RBM are Markov networks (Markov random field) because their links are undirected (Hinton, 2007). DBN was used in different applications including speech recognition (Mohamed et al., 2009; Mohamed & Hinton, 2.2. Gibbs Sampling 2010; Mohamed et al., 2011) and action recognition (Tay- Gibbs sampling, firstly proposed by (Geman & Geman, lor et al., 2007). Hinton was very excited about the suc- 1984), draws samples from a d-dimensional multivari- cess of RBM and was thinking that the future of neural ate distribution (X) using d conditional distributions networks belongs to DBN. However, his research group P (Bishop, 2006). This sampling algorithm assumes that the proposed two important techniques which were the ReLu conditional distributions of every dimension of data condi- activation function (Glorot et al., 2011) and the dropout tioned on the rest of coordinates are simple to draw samples technique (Srivastava et al., 2014). These two regulariza- from. tion methods prevented overfitting (Ghojogh & Crowley, 2019) and resolved vanishing gradients even without RBM In Gibbs sampling, we desire to sample from a multivariate d pre-training. Hence, backpropagation could be used alone distribution P(X) where X 2 R . Consider the notation d > if the new regularization methods were utilized. The suc- R 3 x := [x1; x2; : : : ; xd] . We start from a random d- cess of neural networks was found out more (LeCun et al., dimensional vector in the range of data. Then, we sample 2015) by its various applications, for example in in image the first dimension of the first sample from the distribution recognition (Karpathy & Fei-Fei, 2015). of the first dimension conditioned on the other dimensions. We do it for all dimensions, where the j-th dimension is This is a tutorial and survey paper on BM, RBM, and DBN. Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey 3 sampled as (Ghojogh et al., 2020): where ln(:) is the natural logarithm. The internal energy is defined as: x ∼ (x j x ; : : : ; x ; x ; : : : ; x ): (1) j P j 1 j−1 j+1 d @ U(β) := β F (β): (6) We do this for all dimensions until all dimensions of the @β first sample are drawn. Then, starting from the first sam- Therefore, we have: ple, we repeat this procedure for the dimensions of the sec- @ −1 @Z ond sample. We iteratively perform this for all samples; U(β) = (− ln(Z)) = however, some initial samples are not yet valid because the @β Z @β −βE(x) algorithm has started from a not-necessarily valid vector. (3) X e (2) X = E(x) = (x)E(x): (7) We accept all samples after some burn-in iterations.