10707 Deep Learning Russ Salakhutdinov

10707 Deep Learning Russ Salakhutdinov Machine Learning Department [email protected] http://www.cs.cmu.edu/~rsalakhu/10707/ Autoencoders Neural Networks Online Course • Disclaimer: Much of the material and slides for this lecture were borrowed from Hugo Larochelle’s class on Neural Networks: • Hugo’s class covers many other topics: convolutional networks, neural language model, Boltzmann machines, autoencoders, sparse coding, etc. • We will use his material for some of the other lectures. 2 Autoencoders Hugo Larochelle Departement´ d’informatique Universite´ de Sherbrooke [email protected] October 16, 2012 Abstract Math for my slides “Autoencoders”. Autoencoders • Hugo Larochelle AutoencodersDepartement´ h(x)= d’informatique g(a(x)) Universite´ de Sherbrooke = sigm(b + Wx) • Feed-forward neural [email protected] trained to reproduce its input at the output layer • DecoderOctober 16, 2012 x = o(a(x)) = sigm(c + W h(x)) Abstract ⇤ b b f(Mathx) forx myl(f slides(x)) “Autoencoders”. = (x x )2 l(f(x)) = For binary(x log(units x )+(1 x ) log(1 x )) • ⌘ k k − k − k k k − k − k P P • b b Encoder b b h(x)=g(a(x)) = sigm(b + Wx) • 3 x = o(a(x)) = sigm(c + W⇤h(x)) b b f(x) x l(f(x)) = (x x )2 l(f(x)) = (x log(x )+(1 x ) log(1 x )) • ⌘ k k − k − k k k − k − k P P b b b b 1 1 Autoencoders Feature Representation Feed-back, Feed-forward, generative, bottom-up top-down Decoder Encoder path Input Image • Details of what goes insider the encoder and decoder matter! • Need constraints to avoid learning an identity. 4 Autoencoders Binary Features z Decoder Encoder filters D filters W. WTz z=σ(Wx) Linear Sigmoid function function path Input Image x 5 Another Autoencoder Model Binary Features z Encoder filters W. σ(WTz) z=σ(Wx) Decoder Sigmoid filters WT function path Binary Input x • Need additional constraints to avoid learning an identity. • Relates to Restricted Boltzmann Machines. • Encoder and Decoder filters can be different. 6 Autoencoders Autoencoders Hugo Larochelle Hugo Larochelle Autoencoders Departement´ d’informatique Departement´ d’informatique Universite´ de Sherbrooke Universite´ de Sherbrooke Hugo Larochelle [email protected] [email protected]´ d’informatique October 16, 2012 Universite´ de Sherbrooke October 16, 2012 [email protected] Abstract October 16, 2012 Abstract Math for my slides “Autoencoders”. Math for my slides “Autoencoders”. • •Abstract h(x)=g(b + Wx) Math for my slides “Autoencoders”. h(x)=g(b + Wx) = sigm(b + Wx) = sigm(b + Wx) • • h(x)=g(a(x)) x = o(cLoss+ W⇤h (Functionx)) • = sigm(b + Wx) = sigm(c + W⇤h(x)) x = o(c + W h(x)) • Loss function for binary inputs ⇤ b = sigm(c + W⇤h(x)) 2 f(x) x l(f(x)) = (xk xk) l(f(x)) = (xk log(xk)+(1 xk) log(1 xk)) • ⌘ k• − − k − − b 2 P Ø Cross-entropyP error function (reconstruction loss) f(x) x l(f(x)) = (x x ) b b b x = • bo(a(x⌘)) k k − k = sigm(c + W⇤h(x)) P b b • Loss function for real-valued inputs b b f(x) x l(f(x)) = 1 (x x )2 l(f(x)) = (x log(x )+(1 x ) log(1 x )) • ⌘ 2 k k − k − k k k − k − k Ø (t) (t) (t) (t) l(f(xsum)) of =squaredx P differencesx (reconstruction loss) P a(x )b b b b • r Ø we use a linear activation− function at the output b (t) (t) b a(x ) = b + Wx (t) ( (t) h(x ) = sigm(7 a(x )) (t) ( (t) a(x ) = c + W>h(x ) ( x(t) = sigm(a(x(t))) b ( (t) (t) b (t) b (t) l(f(x )) = x x ra(x ) ( − (t) (t) b cl(f(x )) = a(x(t))l(f(x )) r ( rb (t) (t) (t) l(f(x )) = W (t) l(f(x )) rh(x ) ( b ra(x ) (t) ⇣ (t) ⌘ (t) (t) a(x(t))l(f(x )) = h(x(bt))l(f(x )) [...,h(x )j(1 h(x )j),...] r 1 ( r − (t) ⇣ (t) ⌘ l(f(x )) = (t) l(f(x )) rb ( ra(x ) (t) (t) (t) (t) (t) > l(f(x )) = (t) l(f(x )) x > + h(x ) (t) l(f(x )) rW ( ra(x ) ra(x ) ⇣ ⌘ ⇣ ⌘ b 1 W⇤ = W> • 1 Autoencoders Hugo Larochelle Departement´ d’informatique Universite´ de Sherbrooke Autoencoders [email protected] AutoencodersOctober 16, 2012 Hugo Larochelle Autoencoders Departement´ d’informatique AutoencodersUniversite´ de Sherbrooke Hugo LarochelleHugo Larochelle [email protected] Hugo LarochelleDepartement´ D d’informatiqueepartement´ Abstract d’informatiqueHugo Larochelle Departement´ d’informatique Math for my slides “Autoencoders”.Universite´ de SherbrookeUniversite´ D deepartement´ Sherbrooke d’informatiqueOctober 16, 2012 Universite´[email protected] de Sherbrooke Universite´ de Sherbrooke Hugo Larochelle [email protected]@[email protected] Departement´ d’informatique • October 16, 2012 Abstract October 16, 2012 Math for my slides “Autoencoders”. Universite´ de Sherbrooke October 16, 2012 Octoberh(x)= 16,g(a( 2012x)) Abstract [email protected] • = sigm(bAbstract+ Wx) Math for my slides “Autoencoders”. Math for my slides “Autoencoders”. h(x)=g(a(x)) Abstract = sigm(Octoberb + Wx) 16, 2012 • Abstract Math for my slides “Autoencoders”.• • h(x)=g(a(x)) Math for my slides “Autoencoders”. x = o(a(xh))(x)=g(a(x)) • = sigm(b + Wx) = sigm(b + Wx) • = sigm(c + W⇤h(x))x = o(a(x)) Abstract h(x)=g(a(x)) b b = sigm(c + W⇤h(x)) •• • Math for my slides “Autoencoders”. f(x) x l(f(x)) = (x x )2 l(f(x)) = (x log(x )+(1b x ) log(1b x )) =k sigm(xk =b +kof(Wx(ax(x) ))) x l(f(x)) = k x(xk = xo()a2k(x))l(f(x)) = k (x log(kx )+(1 x ) log(1 x )) • ⌘ − • ⌘ h(x)=− kg(ka−(x))k − − k k− k − k − k = sigm(c + W⇤h(x)) (t) (t) =(t) sigm(c + W⇤h(x)) P (t) l(f(x )) =PxP x P b •ba(x ) b b • b 2 b • r b = sigm(−b bb+ Wx) bb b f(x) x l(f(x)) = (xk xk) l(f(x)) = (xk log(xk)+(12 xk) log(1 xk)) • k f(x) x l(f(xkb)) = k(xk(t)xk) l(f(x)) = k (x(tk)log(xk()+(1t) xk) log(1 xk())t) • ⌘ − • ⌘ − a(x − ) −b= b +−−Wx a(x ) −= b +hWx−(x)=g(b + Wx) (t) (t) (t) (t) (t) (t) (t) ( (t) a(x(t))l(f(x )) = xP xx = o(a(ax(x))(t))l(f(xP)) = xP x ( P h(x ) = sigm(a(x )) • r b −b • r b −bb (t) b (t) b b = sigm(b + Wx) h(x ) = sigm(a(x )) (t) ( (t) = sigm(c +(t) W⇤h(x)) (t) ( (t) a(x (t)) = c + W>h(x ) • b ba(x ) = b + Wx(t) a(x ) = b +(Wxt) ( b ( b a(x ) = c + W( >h(x ) (t) (t) b b h(x(t)) = sigm(a(x(t))) ( h(x(t)) = sigm(a(xx(t))) = sigm(a(x )) 2 ( x =(t) o(a(x)) ( (t) ( f(x) x l(f(x)) = (x x ) l(f(x)) = (x (log(t) x )+(1 xx ) log(1(t) =x sigm((t))) a(x ))b (t) k k k k a(kx ) k= c + W>kh(x ) a(xk ) = c + W>h(x ) • ( (t) (t) (t) • ⌘ − Loss Function− ( − − (t) ( (t) (t) b =(t)a(x( sigm(t))l(f(xxc ))+ W==⇤ sigm(hx(xa())bx x)) b x = sigm(a(xr)) (( − P P ( b (t) (t) x = o(c + W⇤h(x)) • b b b (t) b b(t) b (t) clb(f(bx )) b = a(x(t))l(f(x )) (t) b • For both cases, the gradient a(x )l(f(x )) = x (t) x r (t) b (t) ( rb b (t()t) (t) (t) (t) (t) l(f(x )) = x x (t) l(f(xr)) = x b x 2 (a(x ) b − (t) (t) = sigm(c + W⇤h(x)) has a veryf(x simple) x form:rl(a(fx (x) ))a( =x ) k((x=k b(x−t+)kWx) r l(f(x)) = h((x(t))l(k(ft)(xx−k))log(x=k)+(1Wb a(x(t)x)lk(f)( log(1x )) xk)) ( (t)(rt) ((t) r • ⌘ (t()t) cl(f(−x )) (=tt)) cl(f(ax(x )) )−l(f=(x a())x(t))l(f(x )) − b − l(f(x ))b = (t) l(f(xb )) b c h(x ) r =a sigm((x ) a((x r)) rb ( rb (t) ⇣ (t) ⌘ (t) (t) (rt) (t)( rb(t) (t) a(x(t))l(f(x )) = (t) h(x(bt))2l(f(x )) [...,h(x )j(1 h(x )j),...] ( (t) (t) l(f(x )) = W (t) (t) l(f(x )) (t) l(f(x )) =(t(x)tP) (t) x h(xf(()t(xt))) xr l((Ptf) (xb)) =a(x )((xk r xk) − a(x ) (t) l(fa((xx h)))(x )=l=(fW(xc +))W(t)rhl(=f(x(x W)))b a(x( )l(f(xr )) k • r b h(x ) −b b a(x >)• ⌘ (t) b ⇣− (t) ⌘ b r r (( r ( (tr) bl(f(⇣x )) =(t) a⌘(x(t))l(f((tx) )) (t) (t) (t)(t) l(f(x )) = (bt) l(f(x )) [...,h(x ) (1 h(x ) ),...] (xt) = sigm(⇣(t) a(xa(x(t))) ) ⌘⇣ r(t) (th)(x ) (t⌘)( r (t) j (t) j (t) l(f(x )) = (bt) l(rf(x )) [...,h((x ) (1r h(xP) ),...] − b a(x ) a(x(t())l(f(xh(x))) = (ht()x(bt))l(f(jx ())t) [j...,h(t)(x )j(1 (ht)(x ()tj) ),...] (t) (t) > r ( r a(x (t) ) =⇣ −b + Wx(t) ⌘ (t) > (t) b rb ( bl(fr(x ))b =Wl(fa((xx(t))))l(f(x =))b a(x )l(f(−x )) x + h(x ) a(x )l(f(x )) (t) ⇣ (t)r ⌘ (r( r ( r r • Parameter gradients areb obtainedl(f(x )) by =backpropagatinga((xt()t))l(f(x )) the⇣ gradient(t) (t) ⌘ (t) l(f(x )) = (t)(t) l(f(x )) (t) ⇣ (t) (t) ⌘ (t) > ⇣ ⌘ (t) r (t) b (t()b r b hl((fx(ax(x )))) = = sigm((t) l(f(xa())x x ))> + h(x ) (t) l(f(x )) b a ( x ( t ) ) l ( f ( x )) like =in a xregularx networkr ( W r a(x ) a(x ) (t) r(t) (t)> (((t) r (t) > r r ( Wl(f(x −)) = a(x(t))l(f(x )) x (t) + h(x ⇣) a(x(t))l(f(x ⌘)) (t) ⇣ ⌘ (t) r ( (t)r(t) W⇤ =aW(x> ) = (tr)c + W(t)> h(x (t) b (t) > (t) l(f(x )) = (t) l(f(x )) x >+ h(x ) (t) l(f(x )) b cl(f(x )) = a(x )l(Wf(x ⇣)) • ⌘ a(x ) ⇣ ⌘ a(x ) r ( r r ( r(t) ( b r Ø important: when using btied weights ( W ⇤ = W > ), W l ( f ( x ))( t ) (t) (t) • (t) • r ⇣ x = sigm(⌘ a(x )) ⇣ ⌘ (t) l(f(x )) = Wb (t) l(f(x )) b h(x isW) the⇤ = sumW> of two gradients a(x ) ( r • ( r b Ø this is because(t)(t) is present⇣ in the encoder(t) and⌘ in the decoder(t) (t) (t) lW(fl((xf(x )))) W= (bt) l(f(x )) [...,h(x )j(1 h(x )j),...1] ra(x• r) ( rh(x ) (t) (t) b (t−) b 1 (t) l(f(x )) = x x (t) a⇣(x ) (t) ⌘ bl(f(x )) =r a(x(t))l(f(x )) ( − 8 r ( r (t) (t) cl(f(x )) =1 a(x(t))l(f(x )) (t) b (t) (t)> (t) 1 (t) > Wl(f(x )) = ra(x(t))l(f(x )) x( +rbh(x ) a(x(t))l(f(x )) r ( r (t) r (t) h⇣(x(t))l(f(x )) ⌘ = Wb a⇣(x(t))l(f(x )) ⌘ r ( r b (t) ⇣ (t) ⌘ (t) (t) (t) l(f(x )) = (bt) l(f(x )) [...,h(x ) (1 h(x ) ),...] ra(x ) ( rh(x ) j − j (t) ⇣ (t) ⌘ bl(f(x )) = a(x(t))l(f(x )) r 1 ( r (t) (t) (t) (t) (t) > l(f(x )) = (t) l(f(x )) x > + h(x ) (t) l(f(x )) rW ( ra(x ) ra(x ) ⇣ ⌘ ⇣ ⌘ b 1 1 Autoencoder (t) (t) • Adapting an autoencoder to a newWl (typef(x of ))inputW Wl(f(x )) W l(f(x(t))) W• r • r (t) • rW Wl(f(x )) W Ø choose a joint distribution p ( x µ ) overµ the inputs,p• (wherexrµ) µ p(x µ) µ • | (t)

10707 Deep Learning Russ Salakhutdinov

Backpropagation and Deep Learning in the Brain

Training Autoencoders by Alternating Minimization

Q-Learning in Continuous State and Action Spaces

Turbo Autoencoder: Deep Learning Based Channel Codes for Point-To-Point Communication Channels

Double Backpropagation for Training Autoencoders Against Adversarial Attack

Artificial Intelligence Applied to Electromechanical Monitoring, A

Unsupervised Speech Representation Learning Using Wavenet Autoencoders Jan Chorowski, Ron J

Approaching Hanabi with Q-Learning and Evolutionary Algorithm

Linear Prediction-Based Wavenet Speech Synthesis

Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning

A Guide to Recurrent Neural Networks and Backpropagation

Feedforward Neural Networks and Word Embeddings