Restricted Boltzmann Machines

12. Restricted Boltzmann machines

Restricted Boltzmann machines

Learning RBMs

Deep Boltzmann machines

Learning DBMs

Restricted Boltzmann machines 12-1 Unsupervised learning Given i.i.d. samples fv (1);:::; v (n)g, learn the joint distribution (v)

Restricted Boltzmann machines 12-2 Restricted Boltzmann machines motivation for studying deep learning I concrete example of parameter leaning I applications in dimensionality reduction, classiﬁcation, collaborative ﬁltering, feature learning, topic modeling I successful in vision, language, speech I unsupervised learning: learn a generative model or a distribution running example: learn distribution over hand-written digits (cf. character recognition) (`) 32¢32 I 32 ¢ 32 binary images v 2 f0; 1g

Restricted Boltzmann machines 12-3 Graphical model

grid n o 1 X X (v) = exp v + v v Z i i ij i j i (i;j )2E

a sample v (`)

Restricted Boltzmann machines 12-4 Restricted Boltzmann machine [Smolensky 1986] “harmoniums”

I undirected graphical model N M I two layers: visible layer v 2 f0; 1g and hidden layer h 2 f0; 1g I fully connected bipartite graph

2 N 2 M 2 N ¢M I RBMs represented by parameters a R , b R , and W R such that n o 1 (v; h) = exp a T v + bT h + v T Wh Z

I consider the marginal distribution (v) of the visible nodes, then any distribution on v can be modeled arbitrarily well by RBM with k 1 hidden nodes, where k is the cardinality of the support of the target distribution

Restricted Boltzmann machines 12-5 Restricted Boltzmann machine

deﬁne energy E(v; h) = a T v bT h v T Wh

1 (v; h) = expf E(v; h)g Z 1 Y Y Y = eai vi ebj hj evi Wij hj Z i2[N ] j 2[M ] (i;j )2E

note that RBM has no intra-layer edges (hence the name restricted) (cf. Boltzmann machine [Hinton,Sejnowski 1985])

Restricted Boltzmann machines 12-6 Restricted Boltzmann machine

it follows from the Markov property of MRF that the conditional distribution a product distribution n o Y n o T (vjh) / exp (a + Wh) v = exp (ai + hWi¡; hi) vi Y i = (vi jh) i2[N ] Y (hjv) = (hj jv) j 2[M ]

hence, inference in the conditional distribution is eﬃcient e.g. compute a conditional marginal (hi jv) Restricted Boltzmann machines 12-7 Q since (hjv) = (hj jv),

(hj = 0jv) (hj = 0jv) = (hj = 0jv) + (hj = 1jv) 1 = h ; i 1 + ebj + W¡j v 1 (hj = 1jv) = (bj +hW¡j ;vi |1 + e {z )} h ; i , (bj + W¡j v ) is a sigmiod similarly, (vi = 1jh) = (ai + hWi¡; hi)

where Wi¡ and W¡j are i-th row and j -th column of W resepectively, and h¡; ¡i denotes the inner product

Restricted Boltzmann machines 12-8 I one interpretation of RBM is to think of it as a stochastic version of a neural network, where nodes and edges correspond to neurons and synaptic connections, and a single node ﬁres (i.e. vi = +1) stochastically from a sigmoid activation function (x) = 1=(1 + e x ),

ai +hWi¡;hi j e h ; i (vi = +1 h) = h ; i = (ai + Wi¡ h ) 1 + eai + Wi¡ h

I another interpretation is to think of the bias a and the components W¡j connected to a hidden node hj encode higher level structure a W + 0101100011010110 h

I if the goal is to sample from this distribution (perhaps to generate images that resemble hand-written digits), the lack of inter-layer connection makes Gibbs sampling particularly easy, since one can apply block Gibbs sampling for each layer all together from (vjh) and (hjv) iteratively

Restricted Boltzmann machines 12-9 What about computing the marginal (v)?

X (v) = (v; h) h2f0;1gM 1 X = expf E(v; h)g Z h n XM o 1 h ; i = exp a T v + log(1 + ebj + W¡j v ) Z j =1 because X X T T T T T T ea v+b h+v Wh = ea v eb h+v Wh h2f0;1gM h2f0;1gM X YM T h ; i = ea v ebj hj + v W¡;j hj h2f0;1gM j =1 YM X T h ; i = ea v ebj hj + v W¡;j hj

j =1 hj 2f0;1g Restricted Boltzmann machines= ¡¡¡ 12-10 n XM o 1 h ; i (v) = exp a T v + log(1 + ebj + W¡j v ) Z j =1 n XM o 1 T = exp a v + softmax(b + hW¡ ; vi) Z j j j =1 ai is the bias in vi , bj is the bias in hj , and W¡j is the output "image" v resulting from the hidden node hj

Restricted Boltzmann machines 12-11 in the spirit of unsupervised learning, how do we extract features such that we achieve, dimensionality reduction?

I suppose we have learned the RBM, and have a; b; W at our disposal (`) I we want to compute features of a new sample v I we use the conditional distribution of the hidden layers as features

(`) (`) (`) (hjv ) = [(h1jv ); ¡¡¡ ; (hM jv )]

a sample v (`)

[0;:2;0;:3;0; 0;0;0;0;:8;0;0;0;:4;:1;0] extracted features

Restricted Boltzmann machines 12-12 Parameter learning (1) (n) I given i.i.d. samples v ;:::; v from the marginal (v), we want to learn the RBM using maximum likelihood estimator X n o 1 (v) = exp a T v + bT h + v T Wh Z h2f0;1gM Xn 1 ` L(a; b; W ) = log (v ( )) n `=1

I Gradient ascent (although not concave, and multi-modal) consider a single sample likelihood F X ` ` ` log (v ( )) = log exp(aT v ( ) + bT h + (v ( ))T Wh) log Z h P P ` ` T T T aT v( )+bT h+(v( ))T Wh a v+b h+v Wh (`) h e hi e @ log (v ) i v;h = Ph @ T (`) T (`) T bi ea v +b h+(v ) Wh Z h = ` [h ] ; [h ] |E(hj{zv( )) }i |E (v{zh) }i data-dependent expectation model expectation (`) @ log (v ) (`) = v [v ] i E(v;h) i @ai (`) @ log (v ) (`) = ` [v h ] ; [v h ] @ E(hjv( )) i j E (v h) i j Restricted BoltzmannWij machines 12-13

F ﬁrst term is easy, since conditional distribution factorize I once we have the gradient update a, b, and W as ¡ (t+1) (t) W = W + j [v h ] ; [v h ] ij ij |E (h v)q(v) i j{z E (v h) i j } @ log @ Wij q(v) is the empirical distribution of the training samples v (1);:::; v (n) I to compete the gradient, we need to compute the second term E(v;h)[vi hj ] requires inference F one approach is to approximately compute the expectation using samples from MCMC, but in general can be quite slow for large network F custom algorithms designed for RBMs: contrastive divergence, persistent contrastive divergence, parallel tempering I MCMC (block Gibbs sampling) for computing E(v;h)[vi hj ] where is deﬁned by current parameters a (t); b(t); W (t)

f (0)g F start with samples vk k2[K ] drawn from the data (i.i.d. uniformly at random with replacement) and repeat f (t)g (t)j (t) F sample hk k2[K ] with (hk vk ), and (t+1) (t+1) (t) F sample fv gk2[K ] with (v jh ) Restricted Boltzmann machinesk k k 12-14 practical parameter learning algorithms for RBM use heuristics to approximate the log-likelihood gradient to speed up use persistent Markov chains to speed up the Markov chain f ; (t)g I jointly update fantasy particles (h v)k k2[K ] and parameters (a (t); b(t); W (t)) by repeating (t) (t) (t) F ﬁx (a ; b ; W ) and sample the next fantasy particles f ; (t+1)g (h v)k k2[K ] according to a Markov chain (t) (t) (t) F update (a ; b ; W ) ¡ (t+1) (t) Wij = Wij + t E(hjv)q(v)[vi hj ] E(v;h)[vi hj ] by computing the model expectation (the second term) via f ; (t+1)g (h v)k k2[K ]

Restricted Boltzmann machines 12-15 contrastive divergence1 [Hinton 2002] I a standard approach for learning RBMs I k-step contrastive divergence (1) (n) F Input: Graph G over v; h, training samples S = fv ;:::; v g F Output: gradient f∆wij gi2[N ];j 2[M ]; f∆ai gi2[N ]; f∆bj gj 2[M ] 1. initialize ∆wij ; ∆ai ; ∆bj = 0 2. Repeat 3. for all v (`) 2 S 4. v(0) v (`) 5. for t = 0;:::; k 1 do 6. for i = 1;:::; N do sample h(t)i (hi jv(t)) 7. for j = 1;:::; M do sample v(t + 1)j (vj jh(t)) 8. for i = 1;:::; N ; j = 1;:::; M do

9. ∆wij ∆wij + E(hi jv(0))[hi v(0)j ] E(hi jv(k))[hi vj ] 10. ∆ai ∆ai + v(0)j v(k)j

11. ∆bj ∆bj + E(hi jv(0))[hi ] E(hi jv(k))[hi ] 12. End for.

Restricted Boltzmann machines 12-16 1Hinton, 2012 IPAM tutorial I since the Markov chain is run for ﬁnite k (often k=1 in practice), the resulting approximation of the gradient is biased (is dependent on the training samples fv (`)g) I Theorem. log (v(0)) = ∆wij + E(hjv)P(v(k))[v(k)i hj ] E(v;h)[vi hj ] @wij

I the gap vanishes as k ! 1 and p(v(k)) ! (v) I in general for ﬁnite k, the bias can lead to contrastive divergence converging to a solution that is not the maximum-likelihood

contrastive divergence with k = 1; 2; 5; 10; 20; 100 and 16 hidden nodes and training data from 4 ¢ 4 bars-and-stripes example 2 2[“An Introduction to Restricted Boltzmann Machines”, Fischer, Igel ] [“Training restrictedRestricted Boltzmann Boltzmannmachines machines” Fischer] 12-17 Example: collaborative ﬁltering

Restricted Boltzmann machines for collaborative ﬁltering [Salakhutdinov, Mnih, Hinton ’07]

I suppose there are M movies and N users I each user provides 5-star ratings for a subset of movies

I model each user as a restricted Boltzmann machine with diﬀerent hidden variables I equivalently, treat each user as a independent sample form RBM

Restricted Boltzmann machines 12-18 for a user who rated m movies let V 2 f0; 1gK ¢m be observed k ratings where vi = 1 if the user gave rating k to movie i let hj be the binary valued hidden variables for j = f1;:::; Jg k users have diﬀerent RBM’s but share the same weights Wij RBM n X5 X5 o 1 (V ; h) = exp (V k )T W k h + (a k )T V k + bT h Z k=1 k=1 learning: compute the gradient for each user and average over all users k k k k Wij (t + 1) = Wij (t) + t (E(hjVsample)[Vi hj ] E(V ;h)[Vi hj ]) Restricted Boltzmann machines 12-19 RBM n X5 X5 o 1 (V ; h) = exp (V k )T W k h + (a k )T V k + bT h Z k=1 k=1 prediction I predicting a single movie vm+1 is easy X r r r¯ (vm+1 = 1jV ) / (vm+1 = 1; vm+1 = 0; V ; h) h2f0;1gJ X n X5 X o k T k r r r T / exp (V ) W h + am+1vm+1 + Wm+1;j hj + b h h2f0;1gJ k=1 j YJ X n X o k k r / Cr exp Vi Wij hj + Wm+1;j hj + bj hj J ; j =1 hj 2f0;1g i k

L I however, predicting L movies require 5 evaluations I instead approximate it by computing (hjV ) and using it to evaluate (v`jh) on 17; 770 ¢ 480; 189 Netﬂix dataset, J = 100 works well

Restricted Boltzmann machines 12-20 Deep Boltzmann machine

deep Boltzmann machine

n o 1 (v; h; s) = exp a T v + bT h + cT s + v T W 1h + h T W 2s Z capable of learning higher level and more complex representations

Restricted Boltzmann machines 12-21 Deep Boltzmann machine [Salakhutdinov, Hinton ’09] MNIST dataset with 60,000 training set I Samples generated by running Gibbs sampler for 100,000 steps

NORB dataset with 24,300 stereo image pairs training set I 25 toy objects in 5 classes (car, truck, plane, animal, human)

Restricted Boltzmann machines 12-22 (stochastic) gradient ascent (drop a, b, c for simplicity)

L(W 1; W 2) = log (v (`)) X n o = log exp (v (`))T W 1h + h T W 2s log Z h;s gradient @ log (v (`)) = @ 1 Wij P P (`) (v (`))T W 1h+hT W 2s (`) (v (`))T W 1h+hT W 2s ; (v hj ) e ; ; (v hj ) e h Ps i v h s i (v (`))T W 1h+hT W 2s h;s e Z (`) = E(hjv (`))[vi hj ] E(v;h)[vi hj ] @ log (v (`)) = (`) [h s ] ; [h s ] @ 2 E(h;sjv ) i j E (h s) i j Wij problem: now even the data-dependent expectation is difficult to compute Restricted Boltzmann machines 12-23 Variational method use naive mean-field approximation to get a lower bound on the objective function L 1; 2 (`) (W W ) = log (v ) ¡ ` ` log (v ( )) D b(h; s) (h; sjv ( )) X KL ` = b(h; s) log (h; s; v ( )) + H (b) h;s 1 2 (`) G(b; W ; W ; v ) 1 2 1 2 (`) instead of maximizing L(W ; W ), maximize G(b; W ; W ; v ) to 1 2 get approximately optimalQ W Qand W ; constrain b(h s) = i pi (hi ) j qj (sj ) for simplicity and let pi = pi (hi = 1) and qj = qj (sj = 1) X X X X 1 (`) 2 G = Wki vk pi + Wij pi qj log Z + H (pi ) + H (qj ) k;i i;j i j 1; 2 f £g; f £g I fix (W W ) and find maximizers pi qj f £g; f £g 1; 2 I fix pi qi and update (W W ) according to gradient ascent 2 2 £ £ Wij (t + 1) = Wij (t) + (pi qj E(h;s)[hi sj ]) Restricted Boltzmann machines 12-24 X X X X 1 (`) 2 G = Wki vk pi + Wij pi qj log Z + H (pi ) + H (qj ) k;i i;j i j

f £g f £g how do we ﬁnd pi and qj ? gradient @ X X G 1 (`) 2 = Wki vk + Wij qj + log pi log(1 pi ) = 0 @pi k j @ X G 2 = Wij pi + log qj log(1 qj ) = 0 @qj i

£ P 1 P pi = (`) W 1v W 2q£ 1 + e k ki k j ij j £ P1 qj = W 2p£ 1 + e i ij i

Restricted Boltzmann machines 12-25