Boltzmann Machines
Sam Roweis
1 Stochastic Networks of network operation must include some probabili stic
step s. So now if a network is in a given state when
we perform an up date on one neuron, the next state
Up till thi s point we have b een studying as so ciative
of theneuron will not b e given determini stically. Such
networks thatop erateina determini stic way. Thatis,
a networkcan never settle into a stable state becaus e
given particular interconnections b etween units, anda
unitswillalways b e changingstateeven if their inputs
particular s et of initial conditions, thenetworks would
do not change. What good is thatyou ask ? Well,if we
always exhibit the same dynamical behaviour going
simply let suchanetworkrun f reely for a longtimeand
downhill in energy, and hence always end up in the
record thestate s it pas s e s through, we can construct a
samestable state. Thi s fe atureoftheir op eration was
probability di str ibution over thos e states. If we take
due tothe f act thatthe rule s of op eration of e achneuron
thos e statevectors or somesubvectors of them as our
1
had no probili stic elemetstothe ir sp eci cation . Thi s
patter ns, then we have solved the problem of how to
fe ature was us eful when wewere intere sted in patter n
make a network \generate" a probability di str ibution
completion andother kinds of explicit computaion for
over a s et of patter ns.
whichwehad some problem tobesolved. The problem
was enco ded as the initial state of our network, our
To begin, cons ider a simple mo di cation of the Hop-
store d knowle dge and constraintswere enco ded bythe
eld net tousestochastic units. What did the or iginal
connections, andthesolution was the nal stable state
up dating rule for theHop eld net say ? It s imply told
intowhichthenetworksettle d.
eachunit toswitchintowhichever of itsstates madethe
2
total energy of thesystem lower . Luckily,if thecon-
Now we are going to cons ider another paradigm. In-
nections between units were all symmetr ic, then each
stead of us ing networks to nd particular outputs for
unit could makethi s deci s ion local ly bysimply comput-
given inputs e.g. as so ciative memory recall or opti-
ingtheenergy di erence E between it b e inginactive
mization problems, wewanttousethem tomodel the
th
andactive. For the i unit thi s was s imply:
statistical behaviour of somepartofourworld. What
X
thi s means is thatwewould liketoshowa networksome
w S E = E 1 E +1 =
ij j i i i
di str ibution of patter ns thatcomes from therealworld
j
and get it to build an inter nal mo del that is capable
of generatingthat same di str ibution of patter ns on its
If E was negative then it was better lower system
own. Such a mo del could be us e d to pro duce b eliev-
energy to be inactive, otherwi s e it was better to be
able patter ns if wenee d some more of them, or p erhap s
active. Thi s gave our up dating rule directly as in note
by examiningthemodel wemay gain some ins ightinto
1.
the structure of the pro ce s s whichgenerated theoriginal
di str ibution. The more clo s ely the probability di str ibu-
We will now mo dify thi s up dating rule to make it
tion over patter ns that thenetwork generates matches
stochastic. Let each unit now set its new state to be
the di str ibution in therealworld, thehappier we will
active with probability:
be with our mo del.
1
p +1 =
i
E =T
i
1+e
Before we can go anyfurther withthi s, I should intro-
duce theidea of a network \generating" a di str ibution
where T is a parameter that describes the \temp era-
of patter ns. Thekey new elementisthatnow our rule s
ture" of thenetwork. Thi s tie s intostatistical phys ics
P
2
1
Remember our Lyapunov function E = w S S
ij i j
For example, the rule for operatingourHop eld netsinthe
j;i whichwas guarentee d tobereduced byeachunit up date? Well high gain limit was s imply: 1Pickaunit. 2Computeitsnet thi s explains why: if unit k were to switch s ign, it can compute inputasthesum of connection weightstoother active units. If the e ect that would have on the global energy function by a thi s net inputispositive, maketheunit activestate=+1, if thi s purely lo cal computation. net inputisnegativeorzeromaketheunit inactivestate=-1. 1 {theunits will usual ly go intothestate whichreduce s 2 Boltzmann Machines the system energy,butthey will sometimes go intothe \wrongstate", just as a phys ical system sometimes but We saw in our di scus s ion of stochastic networks thatwe not often visits higher energy states. At zero temp era- could design networks whichalways moved from state ture, thi s up date rule just re duce s to our old determin- tostatewithoutsettlingdown into astable con gura- istic one. tion bymakingindividual neurons b ehave probabili sti- cally. If wemeasured the f raction of thetimethatthese If weletsuchanetworkrunwe can generate a probabil- networks sp entineachoftheir states when they re ached ity di str ibution over states thatitvi s its. Wemust b e thermal equilibr ium, we could us e suchnetworks to gen- careful, however, to ensure that we measure thi s di s- erate probabilitydistributions over the var ious states. tr ibution only after the network has re ached thermal We further saw that we could encourage thi s equilib- equilibrium which simply means that the averages of rium di str ibution to be s imilar to our world di str ibu- thequantitie s wewillbemeasuringtocharacter ize our tion bychangingthe connection weightsinthenetwork. di str ibution for example theaverage activation of the However, our pre scr iption for the der ivatives of state i unit th i probabilitie s withrespecttotheweights include d only we know if the network will ever re ach such an equi- terms involving the activation states of pairs of units libr ium? Fortunately it tur ns outtobetruethatany S S . Thi s me ans thatsuchnetworks will never b e able networkthatalways wentdownhill in someLyapunov i j tocapture any structure in our world probabilitydistri- function in its determini stic vers ion is guarentee d to bution thatishigher than s econdorder. For example, re ach a thermal equilibr ium in its stochastic vers ion. we could never train suchanetwork withthree unitsto One imp ortant p oint to notice here is that the initial vi s it the states 0; 0; 0, 0; 1; 1, 1; 0; 1, and 1; 1; 0 stateofthenetwork whichwas so crucial in thedeter- withsome probabilitie s, butnottheother four p o s s ible mini stic vers ion i s now unimportant,becaus e if werun states. This is becaus e the rst and s econdorder statis- thenetwork for long enough, the equilibr ium probabil- tics themean of eachcomponentandthe correle ations ity di str ibution over states will be the samenomatter between comp onents are thesameforthe s e four states whichstatewe b egan in. as for the remaining four, and so the network cannot di scr iminatebetween them. Thi s i s a limitation of our In determini stic Hop eld nets, we choose the weights stochastic networks and indee d of all networks with as the sums of patter n correlations in order to make no hidden unitsand only pairwi s e connections, andit certain patter ns stable p oints. Buthowdowe s et the come s f rom thefactthatthestatevectors whichwe are weightsina stochastic networktomakethe di str ibution us ing as our patter ns involve every unit's activation. If it generates matchourworld di str ibution ? Atthermal wewere tousethe activations of only a certain subset of equilibr ium, the probability of nding the network in units as our patter ns then our networks would b e able any particular state dep ends only on the energy of tocapture higher order regular itie s in the di str ibutions thatstate, and i s given by: b ecaus e some of the units would be free to repre s ent are not changingover time. Howdo