Boltzmann Machines

Sam Roweis

1 Networks of network operation must include some probabili stic

step s. So now if a network is in a given state when

we perform an up date on one neuron, the next state

Up till thi s point we have b een studying as so ciative

of theneuron will not b e given determini stically. Such

networks thatop erateina determini stic way. Thatis,

a networkcan never settle into a stable state becaus e

given particular interconnections b etween units, anda

unitswillalways b e changingstateeven if their inputs

particular s et of initial conditions, thenetworks would

do not change. What good is thatyou ask ? Well,if we

always exhibit the same dynamical behaviour going

simply let suchanetworkrun f reely for a longtimeand

downhill in energy, and hence always end up in the

record thestate s it pas s e s through, we can construct a

samestable state. Thi s fe atureoftheir op eration was

probability di str ibution over thos e states. If we take

due tothe f act thatthe rule s of op eration of e achneuron

thos e statevectors or somesubvectors of them as our

1

had no probili stic elemetstothe ir sp eci cation . Thi s

patter ns, then we have solved the problem of how to

fe ature was us eful when wewere intere sted in patter n

make a network \generate" a probability di str ibution

completion andother kinds of explicit computaion for

over a s et of patter ns.

whichwehad some problem tobesolved. The problem

was enco ded as the initial state of our network, our

To begin, cons ider a simple mo di cation of the Hop-

store d knowle dge and constraintswere enco ded bythe

eld net tousestochastic units. What did the or iginal

connections, andthesolution was the nal stable state

up dating rule for theHop eld net say ? It s imply told

intowhichthenetworksettle d.

eachunit toswitchintowhichever of itsstates madethe

2

total energy of thesystem lower . Luckily,if thecon-

Now we are going to cons ider another paradigm. In-

nections between units were all symmetr ic, then each

stead of us ing networks to nd particular outputs for

unit could makethi s deci s ion local ly bysimply comput-

given inputs e.g. as so ciative memory recall or opti-

ingtheenergy di erence E between it b e inginactive

mization problems, wewanttousethem tomodel the

th

andactive. For the i unit thi s was s imply:

statistical behaviour of somepartofourworld. What

X

thi s means is thatwewould liketoshowa networksome

w S E = E 1 E +1 =

ij j i i i

di str ibution of patter ns thatcomes from therealworld

j

and get it to build an inter nal mo del that is capable

of generatingthat same di str ibution of patter ns on its

If E was negative then it was better lower system

own. Such a mo del could be us e d to pro duce b eliev-

energy to be inactive, otherwi s e it was better to be

able patter ns if wenee d some more of them, or p erhap s

active. Thi s gave our up dating rule directly as in note

by examiningthemodel wemay gain some ins ightinto

1.

the structure of the pro ce s s whichgenerated theoriginal

di str ibution. The more clo s ely the probability di str ibu-

We will now mo dify thi s up dating rule to make it

tion over patter ns that thenetwork generates matches

stochastic. Let each unit now set its new state to be

the di str ibution in therealworld, thehappier we will

active with probability:

be with our mo del.

1

p +1 =

i

E =T

i

1+e

Before we can go anyfurther withthi s, I should intro-

duce theidea of a network \generating" a di str ibution

where T is a parameter that describes the \temp era-

of patter ns. Thekey new elementisthatnow our rule s

ture" of thenetwork. Thi s tie s intostatistical phys ics

P

2

1

Remember our Lyapunov function E = w S S

ij i j

For example, the rule for operatingourHop eld netsinthe

j;i

whichwas guarentee d tobereduced byeachunit up date? Well high gain limit was s imply: 1Pickaunit. 2Computeitsnet

thi s explains why: if unit k were to switch s ign, it can compute inputasthesum of connection weightstoother active units. If

the e ect that would have on the global energy function by a thi s net inputispositive, maketheunit activestate=+1, if thi s

purely lo cal computation. net inputisnegativeorzeromaketheunit inactivestate=-1. 1

{theunits will usual ly go intothestate whichreduce s 2 Boltzmann Machines

the system energy,butthey will sometimes go intothe

\wrongstate", just as a phys ical system sometimes but

We saw in our di scus s ion of stochastic networks thatwe

not often visits higher energy states. At zero temp era-

could design networks whichalways moved from state

ture, thi s up date rule just re duce s to our old determin-

tostatewithoutsettlingdown into astable con gura-

istic one.

tion bymakingindividual neurons b ehave probabili sti-

cally. If wemeasured the f raction of thetimethatthese

If weletsuchanetworkrunwe can generate a probabil-

networks sp entineachoftheir states when they re ached

ity di str ibution over states thatitvi s its. Wemust b e

thermal equilibr ium, we could us e suchnetworks to gen-

careful, however, to ensure that we measure thi s di s-

erate probabilitydistributions over the var ious states.

tr ibution only after the network has re ached thermal

We further saw that we could encourage thi s equilib-

equilibrium which simply means that the averages of

rium di str ibution to be s imilar to our world di str ibu-

thequantitie s wewillbemeasuringtocharacter ize our

tion bychangingthe connection weightsinthenetwork.

di str ibution for example theaverage activation of the

However, our pre scr iption for the der ivatives of state

i unit  are not changingover time. Howdo

th i

probabilitie s withrespecttotheweights include d only

we know if the network will ever re ach such an equi-

terms involving the activation states of pairs of units

libr ium? Fortunately it tur ns outtobetruethatany

S S . Thi s me ans thatsuchnetworks will never b e able

networkthatalways wentdownhill in someLyapunov

i j

tocapture any structure in our world probabilitydistri-

function in its determini stic vers ion is guarentee d to

bution thatishigher than s econdorder. For example,

re ach a thermal equilibr ium in its stochastic vers ion.

we could never train suchanetwork withthree unitsto

One imp ortant p oint to notice here is that the initial

vi s it the states 0; 0; 0, 0; 1; 1, 1; 0; 1, and 1; 1; 0

stateofthenetwork whichwas so crucial in thedeter-

withsome probabilitie s, butnottheother four p o s s ible

mini stic vers ion i s now unimportant,becaus e if werun

states. This is becaus e the rst and s econdorder statis-

thenetwork for long enough, the equilibr ium probabil-

tics themean of eachcomponentandthe correle ations

ity di str ibution over states will be the samenomatter

between comp onents are thesameforthe s e four states

whichstatewe b egan in.

as for the remaining four, and so the network cannot

di scr iminatebetween them. Thi s i s a limitation of our

In determini stic Hop eld nets, we choose the weights

stochastic networks and indee d of all networks with

as the sums of patter n correlations in order to make

no hidden unitsand only pairwi s e connections, andit

certain patter ns stable p oints. Buthowdowe s et the

come s f rom thefactthatthestatevectors whichwe are

weightsina stochastic networktomakethe di str ibution

us ing as our patter ns involve every unit's activation. If

it generates matchourworld di str ibution ? Atthermal

wewere tousethe activations of only a certain subset of

equilibr ium, the probability of nding the network in

units as our patter ns then our networks would b e able

any particular state dep ends only on the energy of

tocapture higher order regular itie s in the di str ibutions

thatstate, and i s given by:

b ecaus e some of the units would be free to repre s ent

E

these regular itie s. Such a scheme tur ns out to work;

e

P

P =

theunitsinvolve d in thepatter ns are then calle d visible

E

e

units, andtheothers hidden units.

where P is the equilibr ium probabilityofbeinginstate

Boltzmann machines are essentially an extens ion of

,andthesuminthedenominator i s over all p o s s ible

simple stochastic as so ciative networks to include hid-

states. From thi s and our or iginal energy equation, we

den units { units not involved in the patter n vector.

can computehowtheweightschange the probabilitie s

Hidden units, however, intro duce a new complication:

of thestates:

nowknowingtheworld probability di str ibution of pat-

0 1

ter ns whichwewantournetworkto repro duce tells us

X

only what the probability di str ibution of thestates of

@lnP 1

@ A

S S = P S S

i j

i j

the visible unitsshould b e. Wedonotknowwhatthe

@w T

ij

probability di str ibution of thehidden unitsshould b e,

andhence we do not knowthe full probability di str ibu-

th

where S is thestateofthe k unit in state . These tions P of theentire networkstate whichwenee ded to

k

der ivative s can in pr inciple b e us e d to train theconnec- calculateourweightder ivatives @lnP =@ w so we can't

ij

tion weights, however as we will s ee when weintro duce train our connections. We could make up the hidden

Boltzmann machines, thereisabetter way to ndthe unit di str ibutions somehow, butinfactthewhole idea

weightsinstochastic networks. is thatwewould likethenetworkitself to di scover how

5

tousethe hidden unitsto b e st repre s entthe structure would liketo minimize it . As such, wewould liketo

of our di str ibution of patter ns, andsowedonotwant knowhowchangingvar ious weights will a ect G. Thi s

to have to sp ecify their probabilitie s. Cle arly the old br ings us to the Boltzmann machine le ar ning pro ce-

Hop eld net rule of w / s s will not help us withthe dure.

ij i j

wieghtsto hidden units s ince weonlyhaveknowle dge

of s for vi s ible units. How then will the connection

i

weights get s et ? Boltzmann machine s provide a le ar n-

3 Le ar ning

ing algor ithm whichadapts al l theconnections weights

in the network given only the probability di str ibution

over thevisible units. Letusseehowthi s works.

It tur ns out that all of the information about how a

particular weightchange alters thesystem energy G is

Cons ider a s et of patter ns andtherealworld prob-

available local ly if we are willingtobepatient enough.

+

ability di str ibution P over these patter ns. For each

Thelearning pro ce dure for theBoltzmann machinehas

+

comp onentinthese patter n vectors, we cre ate a vi s ible

twophas e s. In P hase , thevisible units are clamped

unit in theBoltzmann machinewhos e activityisassoci-

to the value of a particular patter n, and the network

ate d withthevalue of thatcomponent. We also cre ate

is allowed to re ach low temp erature thermal equilib-

somenumber of hidden units which are not part of the

rium. Wethen increment thewe ightbetween anytwo

patter ns thatwehop e will repre s enthigher order regu-

unitswhich are both on. This is likeHebbian le ar ning.

lar itie s. All unitsinaBoltzmann machinecomputean

Thi s phas e is rep e ated a large number of times, with

\energy gap":

eachpatter n b egin clamp e d with a f requency corre-

+

sp onding to the the world probability P we would

X

E = E E = w S

i 1 +1 ij j

like to mo del. In P hase , we let the network run

j

f reely no unitsclamped andsample the activitie s of

all theunits. Once wehavereache d p erhap s byanneal-

andthen s et their state accordingtothestochastic up-

ingalowtemp erature equilibr ium and not b efore we

date rule:

1 take enough sample s toobtain reliable average s of s s .

i j

p +1 =

i

E =T

i

Then we decrement thewe ightbetween anytwounits

1+e

which are both on. Thi s is calle d unlearning. If we

If we wait long enough, the system will re ach a low

alter natebetween thephas e s with equal approximately

temp erature thermal equilibr ium in which the proba-

+

equal f requency Phase should acutally b e run a little

bilityofbeinginany global statedep ends only on its

more often, then thi s le ar ning procedure will on av-

3

energy divided bythetemp erature . Wecanestimate

erage re duce the cro s s- between the network's

the probabilitydistribution over the visible unitsinthi s

6

f ree-running di str ibution and our target di str ibution .

\f ree-running" mo debysamplingtheaverage activitie s

It amountstosayingthat:

< S > of all the vi s ible units. Call thi s me asure d

i

di str ibution P | we want it to be clo s e to our de-

1 @G

+

+

=  

i j i j

s ire d di str ibution P . Wewillusethe Kul lback Leibler

@w T

ij

4 +

distance between the di str ibutions P and P as a

+

metr ic of howwell our mo del i s re ectingtheworld: where and are the probabilitie s

i j i j

th th

atthermal equilibr ium of ndingboththe i and j

+

X

P

+ +

unitsontogether when thenetwork i s clamp e d and f ree-

ln = P kP G = GP

+

P

running re sp ectively. For thevisible units,

i j

is set by the target di str ibution of patter ns that we

We can think of G as b e ing s imilar toenergy functions

are clamping, but for the hidden units which are free

we haveused for our networks in the past in that we

+

in b oth phas e s, will b e whatever repre s en-

i j

3

tation of higher order regular itythe networkchooses.

Athightemp erature s wereach equilibr ium quickly,butlow

energy states are only slightly more likely than higher energy

Theamazingthingaboutthi s rule i s thatitworks for

ones. Atlowtemp erature s, the probabilityoflowenergy states

any pair of units, whether both visible, both hidden,

i s s igni cantly higher butittake s forever to get to equilibr ium.

or mixe d. The equation makes intuitive s ens e for vi s-

Astrategy known as whichreduce s thetem-

+

ible units: If < s s > is bigger than < s s > it

p erature as thenetworkruns i s a f ast way toachievealowtem-

i j i j

p erature equilibr ium.

means thatunits i and j are not on together in the f ree

4

Thi s me asure Gxky , also knowasthe relative entropy be-

5

Peter Brown s ee PDP Chapter 7 has p ointed outthatmin-

tween two di str ibutions tells us theineciency of as sumingthat

imizing G i s equivalenttomaximizingthe log likeliho o d of gen-

y is the di str ibtion when the true di str ibution is x. In other

eratingtheenvironmental probability di str ibution when thenet- words, if weknew x we could construct a co deofaverage length

workisrunning f reely andat equilibr ium H xbutifwe only know y then the best codewe can build has

6

For a pro of, s ee Appendix A of Chapter 7 in PDP. average length H x+Gxky .

runningphaseasoften as they should b e accordingto For e achstep wewishtotakeonthe surf ace, wemust

the clamp e d target di str ibution phas e. So we would run all of our patter ns, both in the clamp e d and un-

exp ect towantto increase w ,theconnection b etween clamp e d phases. Only then do weobtain the informa-

ij

them; which i s exactly whattheequation says todoin tion for oneweightupdate.

order toreduce theenergy G.

Now, nested ins ide thi s top level search is another

A crucial p ointthatisoften mi sunderstood is thatthe s e arch that we must perform each time we want to

information abouthowaweightwillchange theenergy computea step in G weight space. Wehavetosettle

G is only available lo cally from the time evolution of toalowtemp erature thermal equilibr iuminthe\cur-

theactivitie s. Only when wesettle to equilibr iumand rent" energy landscap e in order to achieve a state in

takemany sample s there will thi s information \grow" which low energy states o ccur much more f requently

outofthe noi s e in the system. It is impossible totell than high energy ones. To do thi s, we us e simulated

just f rom the activitiesoftwounitsatone instanthow annealing or someother s e archtechnique butthi s usu-

changingthewe ightbetween them will a ect G. ally takes quite a bit of time. We can't skip thi s step,

or els e the probabilitie s P that we sample will not

Let us try tounderstandwhatisgoingonintheabove b e repre s entativeofthe currentlandscap e's structure.

+

algor ithm. Dur ing P hase , we are showing the ma- Finally,neste d within thi s s ettlingisyet another time

chinewhatwewantittomodel and encouragingitto consuming pro ce s s: once weareatthermal equilibr ium

mimic our di str ibution by positive le ar ning. Dur ing wehavetospend a large amountoftimethere sampling

Phase , we are attemptingto eliminate accidental or the activitie s of units in order to get a go o d estimate

+ 7

spur ious correlations b etween unitsby s imply degrad- for   .

i j i j

ing all pairs that are activetogether, under theassump-

tion thatthe correct correlations will b e builtupagain

dur ingthe p ositivephas e.

4 Go o d and Bad Feature s

Atthe highe st level, we are s imply doing gradientde-

scentinthefunction G. Eachtimeweruna phas e we

Boltzmann machines have b een foundtogive excellent

are able to computeasmall change in a s ingle weight

p erformance on manystatistical deci s ion tasks, gre atly

8

which will on average re duce G. ButeachpointinG

outstr ippingsimple backpropnetworks . Their ability

space we ight space is actually a whole energy land-

toencode gre ater than s econdorder regular itie s in data

scap e over all the machine's p ossible states. So each

makes them extremely p owerful. They also providea

timewemake an adjustmenttothewe ightstoreduce

very convenientBayesian me asure of howgoodapar-

G, theenergie s of all states change. In e ect our goal

ticular mo del or inter nal repre s entation i s { they s imply

in minimizing G is todeform theenergy landscap e so it

ask: Howlikely i s themodel to generatethedistribution

matches our target energie s over all thepatter ns. Thi s

of patter ns f rom theworld ? In thi s s ens e they incor-

is shown in the gure b elow:

p oratethe maximum likelihood pr inciple directly into

the ir structure. However, they are excruciatingly slow.

Thi s i s in part due tothemanyneste d lo ops involved

in the le ar ning procedure as described above. But

it is also largely due to the fact that Boltzmann ma-

chines repre s ent probabilitie s directly: their units are

actively tur ingonando to repre s ent a certain activ-

itylevel, not s imply holdingavalue which enco des that

level of activation. There has b een much researchinto Energy

Energy

an approximation of theBoltzmann machinedynamics

Patterns

wn as the mean eld approximation whichattempts Patterns kno

'Target' Pattern Energies

o addre s s these bottlenecks. It degrades performance

= Minimum of G t

slightly but i s foundtobemuch f aster; however thatis

andoutalltoits elf.

G ah

7

earetakingthe di erence of two noi sy random var i-

weights Since w

able s in order toestimatethe correlation, the error only decre as e s p

Energy

as N for N sample s.

Patterns 8

See in particular thestudy by Kohonen, Bar na & Chr i sley,

Statistical Pattern Recognition with Neural Networks 1988 in

IEEE ICNN San Diego, pp. 61-68 volumeI.