10707 Deep Learning Russ Salakhutdinov

10707 Deep Learning Russ Salakhutdinov Machine Learning Department [email protected] http://www.cs.cmu.edu/~rsalakhu/10707/ Autoencoders Neural Networks Online Course • Disclaimer: Much of the material and slides for this lecture were borrowed from Hugo Larochelle’s class on Neural Networks: • Hugo’s class covers many other topics: convolutional networks, neural language model, Boltzmann machines, autoencoders, sparse coding, etc. • We will use his material for some of the other lectures. 2 Autoencoders Hugo Larochelle Departement´ d’informatique Universite´ de Sherbrooke [email protected] October 16, 2012 Abstract Math for my slides “Autoencoders”. Autoencoders • Hugo Larochelle AutoencodersDepartement´ h(x)= d’informatique g(a(x)) Universite´ de Sherbrooke = sigm(b + Wx) • Feed-forward neural [email protected] trained to reproduce its input at the output layer • DecoderOctober 16, 2012 x = o(a(x)) = sigm(c + W h(x)) Abstract ⇤ b b f(Mathx) forx myl(f slides(x)) “Autoencoders”. = (x x )2 l(f(x)) = For binary(x log(units x )+(1 x ) log(1 x )) • ⌘ k k − k − k k k − k − k P P • b b Encoder b b h(x)=g(a(x)) = sigm(b + Wx) • 3 x = o(a(x)) = sigm(c + W⇤h(x)) b b f(x) x l(f(x)) = (x x )2 l(f(x)) = (x log(x )+(1 x ) log(1 x )) • ⌘ k k − k − k k k − k − k P P b b b b 1 1 Autoencoders Feature Representation Feed-back, Feed-forward, generative, bottom-up top-down Decoder Encoder path Input Image • Details of what goes insider the encoder and decoder matter! • Need constraints to avoid learning an identity. 4 Autoencoders Binary Features z Decoder Encoder filters D filters W. WTz z=σ(Wx) Linear Sigmoid function function path Input Image x 5 Another Autoencoder Model Binary Features z Encoder filters W. σ(WTz) z=σ(Wx) Decoder Sigmoid filters WT function path Binary Input x • Need additional constraints to avoid learning an identity. • Relates to Restricted Boltzmann Machines. • Encoder and Decoder filters can be different. 6 Autoencoders Autoencoders Hugo Larochelle Hugo Larochelle Autoencoders Departement´ d’informatique Departement´ d’informatique Universite´ de Sherbrooke Universite´ de Sherbrooke Hugo Larochelle [email protected] [email protected]´ d’informatique October 16, 2012 Universite´ de Sherbrooke October 16, 2012 [email protected] Abstract October 16, 2012 Abstract Math for my slides “Autoencoders”. Math for my slides “Autoencoders”. • •Abstract h(x)=g(b + Wx) Math for my slides “Autoencoders”. h(x)=g(b + Wx) = sigm(b + Wx) = sigm(b + Wx) • • h(x)=g(a(x)) x = o(cLoss+ W⇤h (Functionx)) • = sigm(b + Wx) = sigm(c + W⇤h(x)) x = o(c + W h(x)) • Loss function for binary inputs ⇤ b = sigm(c + W⇤h(x)) 2 f(x) x l(f(x)) = (xk xk) l(f(x)) = (xk log(xk)+(1 xk) log(1 xk)) • ⌘ k• − − k − − b 2 P Ø Cross-entropyP error function (reconstruction loss) f(x) x l(f(x)) = (x x ) b b b x = • bo(a(x⌘)) k k − k = sigm(c + W⇤h(x)) P b b • Loss function for real-valued inputs b b f(x) x l(f(x)) = 1 (x x )2 l(f(x)) = (x log(x )+(1 x ) log(1 x )) • ⌘ 2 k k − k − k k k − k − k Ø (t) (t) (t) (t) l(f(xsum)) of =squaredx P differencesx (reconstruction loss) P a(x )b b b b • r Ø we use a linear activation− function at the output b (t) (t) b a(x ) = b + Wx (t) ( (t) h(x ) = sigm(7 a(x )) (t) ( (t) a(x ) = c + W>h(x ) ( x(t) = sigm(a(x(t))) b ( (t) (t) b (t) b (t) l(f(x )) = x x ra(x ) ( − (t) (t) b cl(f(x )) = a(x(t))l(f(x )) r ( rb (t) (t) (t) l(f(x )) = W (t) l(f(x )) rh(x ) ( b ra(x ) (t) ⇣ (t) ⌘ (t) (t) a(x(t))l(f(x )) = h(x(bt))l(f(x )) [...,h(x )j(1 h(x )j),...] r 1 ( r − (t) ⇣ (t) ⌘ l(f(x )) = (t) l(f(x )) rb ( ra(x ) (t) (t) (t) (t) (t) > l(f(x )) = (t) l(f(x )) x > + h(x ) (t) l(f(x )) rW ( ra(x ) ra(x ) ⇣ ⌘ ⇣ ⌘ b 1 W⇤ = W> • 1 Autoencoders Hugo Larochelle Departement´ d’informatique Universite´ de Sherbrooke Autoencoders [email protected] AutoencodersOctober 16, 2012 Hugo Larochelle Autoencoders Departement´ d’informatique AutoencodersUniversite´ de Sherbrooke Hugo LarochelleHugo Larochelle [email protected] Hugo LarochelleDepartement´ D d’informatiqueepartement´ Abstract d’informatiqueHugo Larochelle Departement´ d’informatique Math for my slides “Autoencoders”.Universite´ de SherbrookeUniversite´ D deepartement´ Sherbrooke d’informatiqueOctober 16, 2012 Universite´[email protected] de Sherbrooke Universite´ de Sherbrooke Hugo Larochelle [email protected]@[email protected] Departement´ d’informatique • October 16, 2012 Abstract October 16, 2012 Math for my slides “Autoencoders”. Universite´ de Sherbrooke October 16, 2012 Octoberh(x)= 16,g(a( 2012x)) Abstract [email protected] • = sigm(bAbstract+ Wx) Math for my slides “Autoencoders”. Math for my slides “Autoencoders”. h(x)=g(a(x)) Abstract = sigm(Octoberb + Wx) 16, 2012 • Abstract Math for my slides “Autoencoders”.• • h(x)=g(a(x)) Math for my slides “Autoencoders”. x = o(a(xh))(x)=g(a(x)) • = sigm(b + Wx) = sigm(b + Wx) • = sigm(c + W⇤h(x))x = o(a(x)) Abstract h(x)=g(a(x)) b b = sigm(c + W⇤h(x)) •• • Math for my slides “Autoencoders”. f(x) x l(f(x)) = (x x )2 l(f(x)) = (x log(x )+(1b x ) log(1b x )) =k sigm(xk =b +kof(Wx(ax(x) ))) x l(f(x)) = k x(xk = xo()a2k(x))l(f(x)) = k (x log(kx )+(1 x ) log(1 x )) • ⌘ − • ⌘ h(x)=− kg(ka−(x))k − − k k− k − k − k = sigm(c + W⇤h(x)) (t) (t) =(t) sigm(c + W⇤h(x)) P (t) l(f(x )) =PxP x P b •ba(x ) b b • b 2 b • r b = sigm(−b bb+ Wx) bb b f(x) x l(f(x)) = (xk xk) l(f(x)) = (xk log(xk)+(12 xk) log(1 xk)) • k f(x) x l(f(xkb)) = k(xk(t)xk) l(f(x)) = k (x(tk)log(xk()+(1t) xk) log(1 xk())t) • ⌘ − • ⌘ − a(x − ) −b= b +−−Wx a(x ) −= b +hWx−(x)=g(b + Wx) (t) (t) (t) (t) (t) (t) (t) ( (t) a(x(t))l(f(x )) = xP xx = o(a(ax(x))(t))l(f(xP)) = xP x ( P h(x ) = sigm(a(x )) • r b −b • r b −bb (t) b (t) b b = sigm(b + Wx) h(x ) = sigm(a(x )) (t) ( (t) = sigm(c +(t) W⇤h(x)) (t) ( (t) a(x (t)) = c + W>h(x ) • b ba(x ) = b + Wx(t) a(x ) = b +(Wxt) ( b ( b a(x ) = c + W( >h(x ) (t) (t) b b h(x(t)) = sigm(a(x(t))) ( h(x(t)) = sigm(a(xx(t))) = sigm(a(x )) 2 ( x =(t) o(a(x)) ( (t) ( f(x) x l(f(x)) = (x x ) l(f(x)) = (x (log(t) x )+(1 xx ) log(1(t) =x sigm((t))) a(x ))b (t) k k k k a(kx ) k= c + W>kh(x ) a(xk ) = c + W>h(x ) • ( (t) (t) (t) • ⌘ − Loss Function− ( − − (t) ( (t) (t) b =(t)a(x( sigm(t))l(f(xxc ))+ W==⇤ sigm(hx(xa())bx x)) b x = sigm(a(xr)) (( − P P ( b (t) (t) x = o(c + W⇤h(x)) • b b b (t) b b(t) b (t) clb(f(bx )) b = a(x(t))l(f(x )) (t) b • For both cases, the gradient a(x )l(f(x )) = x (t) x r (t) b (t) ( rb b (t()t) (t) (t) (t) (t) l(f(x )) = x x (t) l(f(xr)) = x b x 2 (a(x ) b − (t) (t) = sigm(c + W⇤h(x)) has a veryf(x simple) x form:rl(a(fx (x) ))a( =x ) k((x=k b(x−t+)kWx) r l(f(x)) = h((x(t))l(k(ft)(xx−k))log(x=k)+(1Wb a(x(t)x)lk(f)( log(1x )) xk)) ( (t)(rt) ((t) r • ⌘ (t()t) cl(f(−x )) (=tt)) cl(f(ax(x )) )−l(f=(x a())x(t))l(f(x )) − b − l(f(x ))b = (t) l(f(xb )) b c h(x ) r =a sigm((x ) a((x r)) rb ( rb (t) ⇣ (t) ⌘ (t) (t) (rt) (t)( rb(t) (t) a(x(t))l(f(x )) = (t) h(x(bt))2l(f(x )) [...,h(x )j(1 h(x )j),...] ( (t) (t) l(f(x )) = W (t) (t) l(f(x )) (t) l(f(x )) =(t(x)tP) (t) x h(xf(()t(xt))) xr l((Ptf) (xb)) =a(x )((xk r xk) − a(x ) (t) l(fa((xx h)))(x )=l=(fW(xc +))W(t)rhl(=f(x(x W)))b a(x( )l(f(xr )) k • r b h(x ) −b b a(x >)• ⌘ (t) b ⇣− (t) ⌘ b r r (( r ( (tr) bl(f(⇣x )) =(t) a⌘(x(t))l(f((tx) )) (t) (t) (t)(t) l(f(x )) = (bt) l(f(x )) [...,h(x ) (1 h(x ) ),...] (xt) = sigm(⇣(t) a(xa(x(t))) ) ⌘⇣ r(t) (th)(x ) (t⌘)( r (t) j (t) j (t) l(f(x )) = (bt) l(rf(x )) [...,h((x ) (1r h(xP) ),...] − b a(x ) a(x(t())l(f(xh(x))) = (ht()x(bt))l(f(jx ())t) [j...,h(t)(x )j(1 (ht)(x ()tj) ),...] (t) (t) > r ( r a(x (t) ) =⇣ −b + Wx(t) ⌘ (t) > (t) b rb ( bl(fr(x ))b =Wl(fa((xx(t))))l(f(x =))b a(x )l(f(−x )) x + h(x ) a(x )l(f(x )) (t) ⇣ (t)r ⌘ (r( r ( r r • Parameter gradients areb obtainedl(f(x )) by =backpropagatinga((xt()t))l(f(x )) the⇣ gradient(t) (t) ⌘ (t) l(f(x )) = (t)(t) l(f(x )) (t) ⇣ (t) (t) ⌘ (t) > ⇣ ⌘ (t) r (t) b (t()b r b hl((fx(ax(x )))) = = sigm((t) l(f(xa())x x ))> + h(x ) (t) l(f(x )) b a ( x ( t ) ) l ( f ( x )) like =in a xregularx networkr ( W r a(x ) a(x ) (t) r(t) (t)> (((t) r (t) > r r ( Wl(f(x −)) = a(x(t))l(f(x )) x (t) + h(x ⇣) a(x(t))l(f(x ⌘)) (t) ⇣ ⌘ (t) r ( (t)r(t) W⇤ =aW(x> ) = (tr)c + W(t)> h(x (t) b (t) > (t) l(f(x )) = (t) l(f(x )) x >+ h(x ) (t) l(f(x )) b cl(f(x )) = a(x )l(Wf(x ⇣)) • ⌘ a(x ) ⇣ ⌘ a(x ) r ( r r ( r(t) ( b r Ø important: when using btied weights ( W ⇤ = W > ), W l ( f ( x ))( t ) (t) (t) • (t) • r ⇣ x = sigm(⌘ a(x )) ⇣ ⌘ (t) l(f(x )) = Wb (t) l(f(x )) b h(x isW) the⇤ = sumW> of two gradients a(x ) ( r • ( r b Ø this is because(t)(t) is present⇣ in the encoder(t) and⌘ in the decoder(t) (t) (t) lW(fl((xf(x )))) W= (bt) l(f(x )) [...,h(x )j(1 h(x )j),...1] ra(x• r) ( rh(x ) (t) (t) b (t−) b 1 (t) l(f(x )) = x x (t) a⇣(x ) (t) ⌘ bl(f(x )) =r a(x(t))l(f(x )) ( − 8 r ( r (t) (t) cl(f(x )) =1 a(x(t))l(f(x )) (t) b (t) (t)> (t) 1 (t) > Wl(f(x )) = ra(x(t))l(f(x )) x( +rbh(x ) a(x(t))l(f(x )) r ( r (t) r (t) h⇣(x(t))l(f(x )) ⌘ = Wb a⇣(x(t))l(f(x )) ⌘ r ( r b (t) ⇣ (t) ⌘ (t) (t) (t) l(f(x )) = (bt) l(f(x )) [...,h(x ) (1 h(x ) ),...] ra(x ) ( rh(x ) j − j (t) ⇣ (t) ⌘ bl(f(x )) = a(x(t))l(f(x )) r 1 ( r (t) (t) (t) (t) (t) > l(f(x )) = (t) l(f(x )) x > + h(x ) (t) l(f(x )) rW ( ra(x ) ra(x ) ⇣ ⌘ ⇣ ⌘ b 1 1 Autoencoder (t) (t) • Adapting an autoencoder to a newWl (typef(x of ))inputW Wl(f(x )) W l(f(x(t))) W• r • r (t) • rW Wl(f(x )) W Ø choose a joint distribution p ( x µ ) overµ the inputs,p• (wherexrµ) µ p(x µ) µ • | (t)

10707 Deep Learning Russ Salakhutdinov

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support