Boltzmann Machines 1 Stochastic Networks

Boltzmann Machines

Sam Roweis

1 Stochastic Networks of network operation must include some probabili stic

step s. So now if a network is in a given state when

we perform an up date on one neuron, the next state

Up till thi s point we have b een studying as so ciative

of theneuron will not b e given determini stically. Such

networks thatop erateina determini stic way. Thatis,

a networkcan never settle into a stable state becaus e

given particular interconnections b etween units, anda

unitswillalways b e changingstateeven if their inputs

particular s et of initial conditions, thenetworks would

do not change. What good is thatyou ask ? Well,if we

always exhibit the same dynamical behaviour going

simply let suchanetworkrun f reely for a longtimeand

downhill in energy, and hence always end up in the

record thestate s it pas s e s through, we can construct a

samestable state. Thi s fe atureoftheir op eration was

probability di str ibution over thos e states. If we take

due tothe f act thatthe rule s of op eration of e achneuron

thos e statevectors or somesubvectors of them as our

had no probili stic elemetstothe ir sp eci cation . Thi s

patter ns, then we have solved the problem of how to

fe ature was us eful when wewere intere sted in patter n

make a network \generate" a probability di str ibution

completion andother kinds of explicit computaion for

over a s et of patter ns.

whichwehad some problem tobesolved. The problem

was enco ded as the initial state of our network, our

To begin, cons ider a simple mo di cation of the Hop-

store d knowle dge and constraintswere enco ded bythe

eld net tousestochastic units. What did the or iginal

connections, andthesolution was the nal stable state

up dating rule for theHop eld net say ? It s imply told

intowhichthenetworksettle d.

eachunit toswitchintowhichever of itsstates madethe

total energy of thesystem lower . Luckily,if thecon-

Now we are going to cons ider another paradigm. In-

nections between units were all symmetr ic, then each

stead of us ing networks to nd particular outputs for

unit could makethi s deci s ion local ly bysimply comput-

given inputs e.g. as so ciative memory recall or opti-

ingtheenergy di erence E between it b e inginactive

mization problems, wewanttousethem tomodel the

andactive. For the i unit thi s was s imply:

statistical behaviour of somepartofourworld. What

thi s means is thatwewould liketoshowa networksome

w S E = E 1 E +1 =

ij j i i i

di str ibution of patter ns thatcomes from therealworld

and get it to build an inter nal mo del that is capable

of generatingthat same di str ibution of patter ns on its

If E was negative then it was better lower system

own. Such a mo del could be us e d to pro duce b eliev-

energy to be inactive, otherwi s e it was better to be

able patter ns if wenee d some more of them, or p erhap s

active. Thi s gave our up dating rule directly as in note

by examiningthemodel wemay gain some ins ightinto

the structure of the pro ce s s whichgenerated theoriginal

di str ibution. The more clo s ely the probability di str ibu-

We will now mo dify thi s up dating rule to make it

tion over patter ns that thenetwork generates matches

stochastic. Let each unit now set its new state to be

the di str ibution in therealworld, thehappier we will

active with probability:

be with our mo del.

p +1 =

E =T

1+e

Before we can go anyfurther withthi s, I should intro-

duce theidea of a network \generating" a di str ibution

where T is a parameter that describes the \temp era-

of patter ns. Thekey new elementisthatnow our rule s

ture" of thenetwork. Thi s tie s intostatistical phys ics

Remember our Lyapunov function E = w S S

ij i j

For example, the rule for operatingourHop eld netsinthe

j;i

whichwas guarentee d tobereduced byeachunit up date? Well high gain limit was s imply: 1Pickaunit. 2Computeitsnet

thi s explains why: if unit k were to switch s ign, it can compute inputasthesum of connection weightstoother active units. If

the e ect that would have on the global energy function by a thi s net inputispositive, maketheunit activestate=+1, if thi s

purely lo cal computation. net inputisnegativeorzeromaketheunit inactivestate=-1. 1

{theunits will usual ly go intothestate whichreduce s 2 Boltzmann Machines

the system energy,butthey will sometimes go intothe

\wrongstate", just as a phys ical system sometimes but

We saw in our di scus s ion of stochastic networks thatwe

not often visits higher energy states. At zero temp era-

could design networks whichalways moved from state

ture, thi s up date rule just re duce s to our old determin-

tostatewithoutsettlingdown into astable con gura-

istic one.

tion bymakingindividual neurons b ehave probabili sti-

cally. If wemeasured the f raction of thetimethatthese

If weletsuchanetworkrunwe can generate a probabil-

networks sp entineachoftheir states when they re ached

ity di str ibution over states thatitvi s its. Wemust b e

thermal equilibr ium, we could us e suchnetworks to gen-

careful, however, to ensure that we measure thi s di s-

erate probabilitydistributions over the var ious states.

tr ibution only after the network has re ached thermal

We further saw that we could encourage thi s equilib-

equilibrium which simply means that the averages of

rium di str ibution to be s imilar to our world di str ibu-

thequantitie s wewillbemeasuringtocharacter ize our

tion bychangingthe connection weightsinthenetwork.

di str ibution for example theaverage activation of the

However, our pre scr iption for the der ivatives of state

i unit ~~ are not changingover time. Howdo~~

~~th i~~

~~probabilitie s withrespecttotheweights include d only~~

~~we know if the network will ever re ach such an equi-~~

~~terms involving the activation states of pairs of units~~

~~libr ium? Fortunately it tur ns outtobetruethatany~~

~~S S . Thi s me ans thatsuchnetworks will never b e able~~

~~networkthatalways wentdownhill in someLyapunov~~

~~i j~~

~~tocapture any structure in our world probabilitydistri-~~

~~function in its determini stic vers ion is guarentee d to~~

~~bution thatishigher than s econdorder. For example,~~

~~re ach a thermal equilibr ium in its stochastic vers ion.~~

~~we could never train suchanetwork withthree unitsto~~

~~One imp ortant p oint to notice here is that the initial~~

~~vi s it the states 0; 0; 0, 0; 1; 1, 1; 0; 1, and 1; 1; 0~~

~~stateofthenetwork whichwas so crucial in thedeter-~~

~~withsome probabilitie s, butnottheother four p o s s ible~~

~~mini stic vers ion i s now unimportant,becaus e if werun~~

~~states. This is becaus e the rst and s econdorder statis-~~

~~thenetwork for long enough, the equilibr ium probabil-~~

~~tics themean of eachcomponentandthe correle ations~~

~~ity di str ibution over states will be the samenomatter~~

~~between comp onents are thesameforthe s e four states~~

~~whichstatewe b egan in.~~

~~as for the remaining four, and so the network cannot~~

~~di scr iminatebetween them. Thi s i s a limitation of our~~

~~In determini stic Hop eld nets, we choose the weights~~

~~stochastic networks and indee d of all networks with~~

~~as the sums of patter n correlations in order to make~~

~~no hidden unitsand only pairwi s e connections, andit~~

~~certain patter ns stable p oints. Buthowdowe s et the~~

~~come s f rom thefactthatthestatevectors whichwe are~~

~~weightsina stochastic networktomakethe di str ibution~~

~~us ing as our patter ns involve every unit's activation. If~~

~~it generates matchourworld di str ibution ? Atthermal~~

~~wewere tousethe activations of only a certain subset of~~

~~equilibr ium, the probability of nding the network in~~

~~units as our patter ns then our networks would b e able~~

~~any particular state dep ends only on the energy of~~

~~tocapture higher order regular itie s in the di str ibutions~~

~~thatstate, and i s given by:~~

~~b ecaus e some of the units would be free to repre s ent~~

~~these regular itie s. Such a scheme tur ns out to work;~~

~~P =~~

~~theunitsinvolve d in thepatter ns are then calle d visible~~

~~units, andtheothers hidden units.~~

~~where P is the equilibr ium probabilityofbeinginstate~~

~~Boltzmann machines are essentially an extens ion of~~

~~,andthesuminthedenominator i s over all p o s s ible~~

~~simple stochastic as so ciative networks to include hid-~~

~~states. From thi s and our or iginal energy equation, we~~

~~den units { units not involved in the patter n vector.~~

~~can computehowtheweightschange the probabilitie s~~

~~Hidden units, however, intro duce a new complication:~~

~~of thestates:~~

~~nowknowingtheworld probability di str ibution of pat-~~

~~0 1~~

~~ter ns whichwewantournetworkto repro duce tells us~~

~~only what the probability di str ibution of thestates of~~

~~@lnP 1~~

~~@ A~~

~~S S = P S S~~

~~i j~~

~~the visible unitsshould b e. Wedonotknowwhatthe~~

~~@w T~~

~~probability di str ibution of thehidden unitsshould b e,~~

~~andhence we do not knowthe full probability di str ibu-~~

~~where S is thestateofthe k unit in state . These tions P of theentire networkstate whichwenee ded to~~

~~der ivative s can in pr inciple b e us e d to train theconnec- calculateourweightder ivatives @lnP =@ w so we can't~~

~~tion weights, however as we will s ee when weintro duce train our connections. We could make up the hidden~~

~~Boltzmann machines, thereisabetter way to ndthe unit di str ibutions somehow, butinfactthewhole idea~~

~~weightsinstochastic networks. is thatwewould likethenetworkitself to di scover how~~

~~tousethe hidden unitsto b e st repre s entthe structure would liketo minimize it . As such, wewould liketo~~

~~of our di str ibution of patter ns, andsowedonotwant knowhowchangingvar ious weights will a ect G. Thi s~~

~~to have to sp ecify their probabilitie s. Cle arly the old br ings us to the Boltzmann machine le ar ning pro ce-~~

~~Hop eld net rule of w / s s will not help us withthe dure.~~

~~ij i j~~

~~wieghtsto hidden units s ince weonlyhaveknowle dge~~

~~of s for vi s ible units. How then will the connection~~

~~weights get s et ? Boltzmann machine s provide a le ar n-~~

~~3 Le ar ning~~

~~ing algor ithm whichadapts al l theconnections weights~~

~~in the network given only the probability di str ibution~~

~~over thevisible units. Letusseehowthi s works.~~

~~It tur ns out that all of the information about how a~~

~~particular weightchange alters thesystem energy G is~~

~~Cons ider a s et of patter ns andtherealworld prob-~~

~~available local ly if we are willingtobepatient enough.~~

~~ability di str ibution P over these patter ns. For each~~

~~Thelearning pro ce dure for theBoltzmann machinehas~~

~~comp onentinthese patter n vectors, we cre ate a vi s ible~~

~~twophas e s. In P hase , thevisible units are clamped~~

~~unit in theBoltzmann machinewhos e activityisassoci-~~

~~to the value of a particular patter n, and the network~~

~~ate d withthevalue of thatcomponent. We also cre ate~~

~~is allowed to re ach low temp erature thermal equilib-~~

~~somenumber of hidden units which are not part of the~~

~~rium. Wethen increment thewe ightbetween anytwo~~

~~patter ns thatwehop e will repre s enthigher order regu-~~

~~unitswhich are both on. This is likeHebbian le ar ning.~~

~~lar itie s. All unitsinaBoltzmann machinecomputean~~

~~Thi s phas e is rep e ated a large number of times, with~~

~~\energy gap":~~

~~eachpatter n b egin clamp e d with a f requency corre-~~

~~sp onding to the the world probability P we would~~

~~E = E E = w S~~

~~i 1 +1 ij j~~

~~like to mo del. In P hase , we let the network run~~

~~f reely no unitsclamped andsample the activitie s of~~

~~all theunits. Once wehavereache d p erhap s byanneal-~~

~~andthen s et their state accordingtothestochastic up-~~

~~ingalowtemp erature equilibr ium and not b efore we~~

~~date rule:~~

~~1 take enough sample s toobtain reliable average s of s s .~~

~~i j~~

~~p +1 =~~

~~E =T~~

~~Then we decrement thewe ightbetween anytwounits~~

~~1+e~~

~~which are both on. Thi s is calle d unlearning. If we~~

~~If we wait long enough, the system will re ach a low~~

~~alter natebetween thephas e s with equal approximately~~

~~temp erature thermal equilibr ium in which the proba-~~

~~equal f requency Phase should acutally b e run a little~~

~~bilityofbeinginany global statedep ends only on its~~

~~more often, then thi s le ar ning procedure will on av-~~

~~energy divided bythetemp erature . Wecanestimate~~

~~erage re duce the cro s s-entropy between the network's~~

~~the probabilitydistribution over the visible unitsinthi s~~

~~f ree-running di str ibution and our target di str ibution .~~

~~\f ree-running" mo debysamplingtheaverage activitie s~~

~~It amountstosayingthat:~~

~~< S > of all the vi s ible units. Call thi s me asure d~~

~~di str ibution P | we want it to be clo s e to our de-~~

~~1 @G~~

~~= ~~

~~i j i j~~

~~s ire d di str ibution P . Wewillusethe Kul lback Leibler~~

~~@w T~~

~~4 +~~

~~distance between the di str ibutions P and P as a~~

~~metr ic of howwell our mo del i s re ectingtheworld: where and are the probabilitie s~~

~~i j i j~~

~~th th~~

~~atthermal equilibr ium of ndingboththe i and j~~

~~+ +~~

~~unitsontogether when thenetwork i s clamp e d and f ree-~~

~~ln = P kP G = GP~~

~~running re sp ectively. For thevisible units,~~

~~i j~~

~~is set by the target di str ibution of patter ns that we~~

~~We can think of G as b e ing s imilar toenergy functions~~

~~are clamping, but for the hidden units which are free~~

~~we haveused for our networks in the past in that we~~

~~in b oth phas e s, will b e whatever repre s en-~~

~~i j~~

~~tation of higher order regular itythe networkchooses.~~

~~Athightemp erature s wereach equilibr ium quickly,butlow~~

~~energy states are only slightly more likely than higher energy~~

~~Theamazingthingaboutthi s rule i s thatitworks for~~

~~ones. Atlowtemp erature s, the probabilityoflowenergy states~~

~~any pair of units, whether both visible, both hidden,~~

~~i s s igni cantly higher butittake s forever to get to equilibr ium.~~

~~or mixe d. The equation makes intuitive s ens e for vi s-~~

~~Astrategy known as simulated annealing whichreduce s thetem-~~

~~ible units: If < s s > is bigger than < s s > it~~

~~p erature as thenetworkruns i s a f ast way toachievealowtem-~~

~~i j i j~~

~~p erature equilibr ium.~~

~~means thatunits i and j are not on together in the f ree~~

~~Thi s me asure Gxky , also knowasthe relative entropy be-~~

~~Peter Brown s ee PDP Chapter 7 has p ointed outthatmin-~~

~~tween two di str ibutions tells us theineciency of as sumingthat~~

~~imizing G i s equivalenttomaximizingthe log likeliho o d of gen-~~

~~y is the di str ibtion when the true di str ibution is x. In other~~

~~eratingtheenvironmental probability di str ibution when thenet- words, if weknew x we could construct a co deofaverage length~~

~~workisrunning f reely andat equilibr ium H xbutifwe only know y then the best codewe can build has~~

~~For a pro of, s ee Appendix A of Chapter 7 in PDP. average length H x+Gxky .~~

~~runningphaseasoften as they should b e accordingto For e achstep wewishtotakeonthe surf ace, wemust~~

~~the clamp e d target di str ibution phas e. So we would run all of our patter ns, both in the clamp e d and un-~~

~~exp ect towantto increase w ,theconnection b etween clamp e d phases. Only then do weobtain the informa-~~

~~them; which i s exactly whattheequation says todoin tion for oneweightupdate.~~

~~order toreduce theenergy G.~~

~~Now, nested ins ide thi s top level search is another~~

~~A crucial p ointthatisoften mi sunderstood is thatthe s e arch that we must perform each time we want to~~

~~information abouthowaweightwillchange theenergy computea step in G weight space. Wehavetosettle~~

~~G is only available lo cally from the time evolution of toalowtemp erature thermal equilibr iuminthe\cur-~~

~~theactivitie s. Only when wesettle to equilibr iumand rent" energy landscap e in order to achieve a state in~~

~~takemany sample s there will thi s information \grow" which low energy states o ccur much more f requently~~

~~outofthe noi s e in the system. It is impossible totell than high energy ones. To do thi s, we us e simulated~~

~~just f rom the activitiesoftwounitsatone instanthow annealing or someother s e archtechnique butthi s usu-~~

~~changingthewe ightbetween them will a ect G. ally takes quite a bit of time. We can't skip thi s step,~~

~~or els e the probabilitie s P that we sample will not~~

~~Let us try tounderstandwhatisgoingonintheabove b e repre s entativeofthe currentlandscap e's structure.~~

~~algor ithm. Dur ing P hase , we are showing the ma- Finally,neste d within thi s s ettlingisyet another time~~

~~chinewhatwewantittomodel and encouragingitto consuming pro ce s s: once weareatthermal equilibr ium~~

~~mimic our di str ibution by positive le ar ning. Dur ing wehavetospend a large amountoftimethere sampling~~

~~Phase , we are attemptingto eliminate accidental or the activitie s of units in order to get a go o d estimate~~

~~+ 7~~

~~spur ious correlations b etween unitsby s imply degrad- for ~~ .~~~~

~~i j i j~~

~~ing all pairs that are activetogether, under theassump-~~

~~tion thatthe correct correlations will b e builtupagain~~

~~dur ingthe p ositivephas e.~~

~~4 Go o d and Bad Feature s~~

~~Atthe highe st level, we are s imply doing gradientde-~~

~~scentinthefunction G. Eachtimeweruna phas e we~~

~~Boltzmann machines have b een foundtogive excellent~~

~~are able to computeasmall change in a s ingle weight~~

~~p erformance on manystatistical deci s ion tasks, gre atly~~

~~which will on average re duce G. ButeachpointinG~~

~~outstr ippingsimple backpropnetworks . Their ability~~

~~space we ight space is actually a whole energy land-~~

~~toencode gre ater than s econdorder regular itie s in data~~

~~scap e over all the machine's p ossible states. So each~~

~~makes them extremely p owerful. They also providea~~

~~timewemake an adjustmenttothewe ightstoreduce~~

~~very convenientBayesian me asure of howgoodapar-~~

~~G, theenergie s of all states change. In e ect our goal~~

~~ticular mo del or inter nal repre s entation i s { they s imply~~

~~in minimizing G is todeform theenergy landscap e so it~~

~~ask: Howlikely i s themodel to generatethedistribution~~

~~matches our target energie s over all thepatter ns. Thi s~~

~~of patter ns f rom theworld ? In thi s s ens e they incor-~~

~~is shown in the gure b elow:~~

~~p oratethe maximum likelihood pr inciple directly into~~

~~the ir structure. However, they are excruciatingly slow.~~

~~Thi s i s in part due tothemanyneste d lo ops involved~~

~~in the le ar ning procedure as described above. But~~

~~it is also largely due to the fact that Boltzmann ma-~~

~~chines repre s ent probabilitie s directly: their units are~~

~~actively tur ingonando to repre s ent a certain activ-~~

~~itylevel, not s imply holdingavalue which enco des that~~

~~level of activation. There has b een much researchinto Energy~~

~~Energy~~

~~an approximation of theBoltzmann machinedynamics~~

~~Patterns~~

~~wn as the mean eld approximation whichattempts Patterns kno~~

~~'Target' Pattern Energies~~

~~o addre s s these bottlenecks. It degrades performance~~

~~= Minimum of G t~~

~~slightly but i s foundtobemuch f aster; however thatis~~

~~andoutalltoits elf.~~

~~G ah~~

~~earetakingthe di erence of two noi sy random var i-~~

~~weights Since w~~

~~able s in order toestimatethe correlation, the error only decre as e s p~~

~~Energy~~

~~as N for N sample s.~~

~~Patterns 8~~

~~See in particular thestudy by Kohonen, Bar na & Chr i sley,~~

~~Statistical Pattern Recognition with Neural Networks 1988 in~~

~~IEEE ICNN San Diego, pp. 61-68 volumeI.~~