Boltzmann Machine

Boltzmann Machine A.L. Yuille. JHU. 2016. Abstract 1. Introduction The Boltzmann Machine (Hinton and Sejnowski) is a method for learning the weights of a probability distribution assuming that a subset of nodes (input nodes) are observed and the remainder are hidden. Gibbs Distribution The probability of the system of N neurons S~ = (S1; :::; SN ), where each Si takes value 0 or 1, is defined by a ~ −1 P Gibbs distribution with energy E(S) = 2 ij !ijSiSj and distribution: 1 P (S~) = exp{−E(S~)=T g: (1) Z States/configurations S~ with low energy E(S~) will correspond to high probability P (S~). This depends on the parameter T ≥ 0. When Gibbs distributions are used in Statistical Mechanics then T corresponds to the temperature of the system. In this course, T is a parameter that characterizes the degree of uncertainly of the distribution. For very small T , T ≈ 0, the distribution will be strongly peaked at those configurations S~ that minimize the energy. But for large T , the distributions becomes less peaked (although the relative ordering of which states are more probable does not change as T varies. The value of T is not important for Boltzmann Machines, but it is for other applications (see the book chapter handout for previous lecture which describes how it used for annealing). P ~ P ~ The term Z is a normalization constant defined so that S~ P (S) = 1. Hence Z = S~ exp{−E(X)=T g. It is important to realize that it is often impossible to compute Z because it requires summing over an exponential N number of configurations (each neuron Si takes two values, S~ contains N neurons, so S~ takes 2 possible values). This means that although we can compute E(S~) we cannot compute P (S~) . This makes doing computations P ~ ~ involving these distributions difficult, e.g., computing S~ SP (S), and is a reason why Gibbs sampling was in- ~1 ~m ~ P ~ ~ 1 Pm ~i vented. It enables us to obtain samples S ; :::; S from P (S) and hence approximate S~ SP (S) by m i=1 S . This will be used by Boltzmann Machines. Note that Z is a function of the weights wij of the energy function E(S~). This has important mathematical implications, such as @ log Z = P S S P (S~), which arise in the derivation of the Boltzmann Machine. But @wij S~ i j these are much more general than Boltzmann Machines and appear frequentlt in machine learning and statistical physics. 2. Gibbs Distribution for the Boltzmann Machine We divide the nodes into two classes Vo and Vh, which are the observed (input) and hidden nodes respectively. We use S~o and S~h to denote the states of the observed and the hidden nodes respectively. The components of S~o and S~h are fSi : i 2 Vog and fSi : i 2 Vhg respectively. Hence S~ = (S~o; S~h). Hence we can re-express the distribution over the states as: 1 P (S~ ; S~ ) = exp{−E(S~)=T g: (2) o h Z The marginal distribution over the observed nodes is X 1 P (S~ ) = exp{−E(S~)=T g: (3) o Z ~ Sh We assume that we can estimate a distribution R(S~0) of the observed nodes (see more in a later section). Then the goal of learning is to adjust the weights ~! of the model (i.e. the f!ijg) so that the marginal distribution P (S~o) of the model is as similar as possible to the observed model R(S~0). This requires specifying a similarity criterion which is chosen to be the Kullback-Leibler divergence: ~ X R(So) KL(~w) = R(S~o) log (4) P (S~o) S~o We will discuss in a later section how this relates to the standard maximum likelihood criterion for learning distributions (it is effectively the same). The Boltzmann Machine adjusts the weights by the iterative update rule: wij 7! wij + ∆wij (5) @KL(~w) ∆wij = −δ (6) !ij δ ∆w = − f< S S > − < S S >g (7) ij T i j clamped i j Here δ is a small positive constant. The derivation of the update rule is given in a later section. The way to compute the update rule is described in the next section. < SiSj >clamped and < SiSj > are the expectation (e.g., correlation) between the state variables Si;Sj when the data is generated by the clamped distribution R(S~o)P (S~hjS~o) and by the distribution P (S~o; S~h) respectively. P ~ ~ ~ I.e. < SiSj >= S~ SiSjP (S). The conditional distribution P (ShjSo) is the distribution over the hidden states conditioned on the observed states. So it is given by P (S~hjS~o) = P (S~h; S~o)=P (S~o). The learning rule, equation (7), has two components. The first term < SiSj >clamped is Hebbian and the second term < SiSj > is anti-Hebbian (because of the sign). This is a balance between the activity of the model when it is driven by input data (i.e. clamped) and when it is driven by itself A wild speculation is that the Hebbian learning is done when you are awake, hence exposed to external stimuli, while the anti-Hebbian learning is done when you are asleep with your eyes shut but, by sampling from P (S~ojS~h) you are creating images, or dreaming. The algorithm will convergence when the model accurately fits the data, i.e.. when < SiSj >clamped=< SiSj > and the right hand side of the update rule, equation (7), is zero. What is the observed distribution R(S~o)? We do not know R(S~o) exactly and so we approximate it by the µ training data fS~o ; µ = 1; :::; Ng. This is equivalent to assuming that N 1 X R(S~) = δ(S~ − S~µ) (8) N o o µ=1 . (We return to this later when we show the relationship to maximum likelihood learning). 2.1. Estimating the < SiSj > The main difficulty of the learning rule for the Boltzmann Machine is how to compute < SiSj >clampaed and < SiSj >. To do this, it is natural to use Gibbs sampling. Recall from earlier lectures that the stochastic update rule for neurons is performing Gibbs sampling – i.e. selecting a neuron i at random, and then sampling Si from the ~ conditional distribution P (SijS=i). 1 M By performing Gibbs sampling multiple times on the distribution P (S~o; S~h) we obtain M samples S~ ; :::; S~ . Then we can approximate < SiSj > by: M 1 X < S S >≈ SaSa (9) i j M i j a=1 1 M Similarly we can obtain samples from R(S~o)P (S~hjS~o) (the clamped case) by first generating samples S~o ; :::; S~o from R(S~0) and then converting them to samples 1 M S~ ; :::; S~ (10) i i i where S~ = (S~o ; S~h ), and S~h is a random sample from P (S~hjS~o), again performed by Gibbs sampling. µ How do we sample from R(S~o)? Recall that we only know samples fS~o ; µ = 1; :::; Ng (the training data). Hence sampling from R(S~o) reduces to selecting one of the training examples at random. The fact that the Boltzmann Machines uses Gibbs sampling is a big limitation. If the model is complicated – i.e. there are many hidden nodes and weights !ij – then Gibbs sampling can take a long time to converge. This means the calculating the learning rule, equation (7), becomes impractical. We can approximate the expectations < SiSj > by equations (9,10), but these approximations are sometimes very bad. This means that Boltzmann Machines are of limited effectiveness. In a later section we discuss Restricted Boltzmann Machines (RBMs) for which we can perform efficient sampling and hence estimate the < SiSj > and < SiSj >clamped effectively. RBMs are used as components to build one type of deep neural networks. Note that in some accounts of Boltzmann Machines say that the BMs have to run to reach thermal equilibrium. This is equivalent to saying that Gibbs sampling yields samples from (S~o; S~h) (and from P (S~hjS~o)). 3. Derivation of the BM update rule To justify the learning rule, equation (7), we need to take the derivative of the cost function @KL(~!)=@!ij. ~ ~ @KL(~w) X R(So) @P (So) = − (11) @!ij P (S~o) @!ij S~o ~ ~ 1 P ~ @P (So) Expressing P (So) = ~ exp{−E(S)=T g, we can express in two terms: Z Sh @!ij 1 @ X 1 X @ log Z exp{−E(S~)=T g − exp{−E(S~)=T )g (12) Z @!ij Z @!ij ~ ~ Sh Sh which can be re-expressed as: −1 X X 1 X S S P (S~) + f P (S~) S S P (S~)g (13) T i j T i j ~ ~ ~ Sh Sh S Hence we can compute: ~ @P (So) −1 X 1 X = SiSjP (S~) + P (S~o) SiSjP (S~) (14) @!ij T T ~ ~ Sh S Substituting equation (14) into equation (11) yields @KL(~w) 1 X P (S~) 1 X X = SiSj R(S~o) − f R(S~o)g SiSjP (S~) (15) @!ij T P (So) T ~ ~ ~ ~ Sh;So So S Which can be simplified to give: @KL(~w) 1 X 1 X = SiSjP (S~hjS~o)R(S~o) − SiSjP (S~) (16) @!ij T T S~ S~ P ~ Note this derivation requires @ log Z=@wij = S~ SiSjP (S). 4. How does the Boltmann Machine relate to Maximum Likelihood Learning? They are equivalent.

Boltzmann Machine

Neural Networks for Machine Learning Lecture 12A The

A Boltzmann Machine Implementation for the D-Wave

Boltzmann Machine Learning with the Latent Maximum Entropy Principle

Stochastic Classical Molecular Dynamics Coupled To

Learning and Evaluating Boltzmann Machines

Learning Algorithms for the Classification Restricted Boltzmann

An Efficient Learning Procedure for Deep Boltzmann Machines

Machine Learning Algorithms Based on Generalized Gibbs Ensembles

Quantum Restricted Boltzmann Machine Is Universal for Quantum Computation

Machine Learning and Quantum Phases of Matter Hugo Théveniaut

Feature Learning and Graphical Models for Protein Sequences

Introduction Boltzmann Machine: Learning