Deep Learning II Unsupervised Learning
Russ Salakhutdinov
Machine Learning Department Carnegie Mellon University Canadian Institute of Advanced Research Talk Roadmap Part 1: Supervised Learning: Deep Networks
Part 2: Unsupervised Learning: Learning Deep Genera ve Models
Part 3: Open Research Ques ons Unsupervised Learning
Non-probabilis c Models Probabilis c (Genera ve) Ø Sparse Coding Models Ø Autoencoders Ø Others (e.g. k-means)
Non-Tractable Models Tractable Models Ø Genera ve Adversarial Ø Fully observed Ø Boltzmann Machines Networks Ø Belief Nets Varia onal Ø Moment Matching Ø NADE Autoencoders Networks Ø PixelRNN Ø Helmholtz Machines Ø Many others…
Explicit Density p(x) Implicit Density Talk Roadmap
• Basic Building Blocks: Ø Sparse Coding Ø Autoencoders
• Deep Genera ve Models Ø Restricted Boltzmann Machines Ø Deep Belief Networks and Deep Boltzmann Machines Ø Helmholtz Machines / Varia onal Autoencoders
• Genera ve Adversarial Networks
• Model Evalua on Sparse Coding • Sparse coding (Olshausen & Field, 1996). Originally developed to explain early visual processing in the brain (edge detec on).
• Objec ve: Given a set of input data vectors learn a dic onary of bases such that:
Sparse: mostly zeros
• Each data vector is represented as a sparse linear combina on of bases. Sparse Coding Natural Images Learned bases: “Edges”
New example
= 0.8 * + 0.3 * + 0.5 *
x = 0.8 * + 0.3 * + 0.5 *
[0, 0, … 0.8, …, 0.3, …, 0.5, …] = coefficients (feature representa on) Slide Credit: Honglak Lee Sparse Coding: Training • Input image patches: • Learn dic onary of bases:
Reconstruc on error Sparsity penalty • Alterna ng Op miza on: 1. Fix dic onary of bases and solve for ac va ons a (a standard Lasso problem). 2. Fix ac va ons a, op mize the dic onary of bases (convex QP problem). Sparse Coding: Tes ng Time • Input: a new image patch x* , and K learned bases • Output: sparse representa on a of an image patch x*.
= 0.8 * + 0.3 * + 0.5 *
x* = 0.8 * + 0.3 * + 0.5 *
[0, 0, … 0.8, …, 0.3, …, 0.5, …] = coefficients (feature representa on) Image Classifica on Evaluated on Caltech101 object category dataset.
Classification Algorithm (SVM)
Learned Input Image bases Features (coefficients) 9K images, 101 classes
Algorithm Accuracy Baseline (Fei-Fei et al., 2004) 16% PCA 37% Sparse Coding 47%
Slide Credit: Honglak Lee (Lee, Battle, Raina, Ng, NIPS 2007) Interpre ng Sparse Coding
a Sparse features a
Implicit g(a) Explicit f(x) Linear nonlinear Decoding encoding x’ x
• Sparse, over-complete representa on a. • Encoding a = f(x) is implicit and nonlinear func on of x. • Reconstruc on (or decoding) x’ = g(a) is linear and explicit. Autoencoder
Feature Representation
Feed-back, Feed-forward, generative, bottom-up top-down Decoder Encoder path
Input Image
• Details of what goes insider the encoder and decoder ma er! • Need constraints to avoid learning an iden ty. Autoencoder
Binary Features z
Decoder Encoder filters D filters W. Dz z=σ(Wx) Linear Sigmoid function function path
Input Image x Autoencoder
Binary Features z
Encoder filters W. σ(WTz) z=σ(Wx) Decoder Sigmoid filters D function path Binary Input x
• Need addi onal constraints to avoid learning an iden ty. • Relates to Restricted Boltzmann Machines (later). Autoencoders
Hugo Larochelle Departement´ d’informatique Universite´ de Sherbrooke [email protected] October 16, 2012
Abstract Math for my slides “Autoencoders”. Autoencoders
• Hugo Larochelle AutoencodersDepartement´ h(x)= d’informatique g(a(x)) Universite´ de Sherbrooke = sigm(b + Wx) • Feed-forward neural network trained to reproduce its input at the output [email protected] layer • Decoder October 16, 2012 x = o(a(x)) = sigm(c + W h(x)) Abstract ⇤ b b f(Mathx) forx myl(f slides(x)) “Autoencoders”. = (x x )2 l(f(x)) = For binary units (x log(x )+(1 x ) log(1 x )) • ⌘ k k k k k k k k P P • b b Encoder b b h(x)=g(a(x)) = sigm(b + Wx)
• x = o(a(x))
= sigm(c + W⇤h(x)) b b f(x) x l(f(x)) = (x x )2 l(f(x)) = (x log(x )+(1 x ) log(1 x )) • ⌘ k k k k k k k k P P b b b b
1
1 Autoencoders Autoencoders Hugo Larochelle Hugo Larochelle Autoencoders Departement´ d’informatique Departement´ d’informatique Universite´ de Sherbrooke Universite´ de Sherbrooke Hugo Larochelle [email protected] [email protected]´ d’informatique October 16, 2012 Universite´ de Sherbrooke October 16, 2012 [email protected]
Abstract October 16, 2012 Abstract Math for my slides “Autoencoders”. Math for my slides “Autoencoders”.
• •Abstract h(x)=g(b + Wx) Math for my slides “Autoencoders”. h(x)=g(b + Wx) = sigm(b + Wx) = sigm(b + Wx) • • h(x)=g(a(x)) x = o(cLoss+ W⇤h (Functionx)) • = sigm(b + Wx) = sigm(c + W⇤h(x)) x = o(c + W h(x)) • Loss function for binary inputs ⇤ b = sigm(c + W⇤h(x)) 2 f(x) x l(f(x)) = (xk xk) l(f(x)) = (xk log(xk)+(1 xk) log(1 xk)) • ⌘ k• k b 2 P Ø Cross-entropyP error function (reconstruction loss) f(x) x l(f(x)) = (x x ) b b b x = • bo(a(x⌘)) k k k = sigm(c + W⇤h(x)) P b b • Loss function for real-valued inputs b b f(x) x l(f(x)) = 1 (x x )2 l(f(x)) = (x log(x )+(1 x ) log(1 x )) • ⌘ 2 k k k k k k k k Ø (t) (t) (t) (t) l(f(xsum)) of =squaredx P differencesx (reconstruction loss) P a(x )b b b b • r Ø we use a linear activation function at the output b a(x(t)) = b + Wx(t) b ( h(x(t)) = sigm(a(x(t))) (t) ( (t) a(x ) = c + W>h(x ) ( x(t) = sigm(a(x(t))) b (
(t) (t) b (t) b (t) l(f(x )) = x x ra(x ) ( (t) (t) b cl(f(x )) = a(x(t))l(f(x )) r ( rb (t) (t) (t) l(f(x )) = W (t) l(f(x )) rh(x ) ( b ra(x ) (t) ⇣ (t) ⌘ (t) (t) a(x(t))l(f(x )) = h(x(bt))l(f(x )) [...,h(x )j(1 h(x )j),...] r 1 ( r (t) ⇣ (t) ⌘ l(f(x )) = (t) l(f(x )) rb ( ra(x ) (t) (t) (t) (t) (t) > l(f(x )) = (t) l(f(x )) x > + h(x ) (t) l(f(x )) rW ( ra(x ) ra(x ) ⇣ ⌘ ⇣ ⌘ b 1
W⇤ = W> •
1 Autoencoder
Linear Features z • If the hidden and output layers are linear, it will learn hidden units that are a linear func on of the data and minimize the squared error. Wz z=Wx • The K hidden units will span the same space as the first k principal components. The weight vectors Input Image x may not be orthogonal.
• With nonlinear hidden units, we have a nonlinear generaliza on of PCA. l(f(x(t))) W • rW p(x µ) µ • | µ h(x) • l(fl((fx((xt))))) =W log p(x µ) • rW• | p(x µ)= 1 exp( 1 (x µ )2) µ = c + W h(x) p(x µ) µ (2⇡)D/2 2 k k k ⇤ • |• | P µ h(xAA) = U ⌃ V> U , k ⌃ k, k V>, k B • • · · l(f(x)) = log p(x µ) • • | B⇤ = arg min A B F 1 1 2 p(x µ)= exp( (x µ ) ) µ = c + WB⇤hs(.tx. )rank(B)=k || || • | (2⇡)D/2 2 k k k B⇤ = U , k ⌃ k,Pk V>, k AA= U ⌃ V>· U ,k ⌃ k, · kV>, k B • · · 1 (t) (t) 2 arg min (xk xk ) arg min X W⇤h(X) F • ✓ 2 W ,h(X) || || Bt ⇤ =k arg min A B F⇤ X XB s.t. rank(B)=k || || b (t) B⇤ =xU ,Xk =⌃ Uk,⌃kVV>>, k • · · arg min X 1 W⇤h(tX) ) F(t) = W⇤ U , k ⌃ k, k, h(X) V>, k 2 · Warg⇤,h min(X) || (xk ||xk ) arg min X W⇤h(X) F · ✓ 2 || || t k W⇤,h(X) X X h(X)=V>, k b · 1 (t) = V>, k (X> X) (X> X) x X = U ⌃ V> · • 1 = V>, k (V ⌃> U> U ⌃ V>) (V ⌃> U> X) · arg min X W⇤h(X) F = W⇤ U , k ⌃ k, k, h(X)1 V>, k || || · · W⇤,h(X) = V>, k (V ⌃> ⌃ V>) V ⌃> U> X · 1 h(X)=V>,=k V>, k V (⌃> ⌃) V> V ⌃> U> X · · 1 1 = V>,=k (XV>> X)V (⌃(X>>⌃X) ) ⌃> U> X · , k · 1 = V>, k (V ⌃> U> U ⌃1 V>) (V ⌃> U> X) · =I k, (⌃> ⌃) ⌃> U> X · 1 = V> (V ⌃> ⌃ 1V>) V1 ⌃> U> X ,=Ik k, ⌃ (⌃>) ⌃> U> X · · 1 1 = V>,=Ik V (k,⌃>⌃⌃ ) UV> >XV ⌃> U> X · · 1 1 = V>,=k V⌃ (⌃> ⌃()U ,⌃k>)>UX> X · k, k · 1 =Ik, (⌃> ⌃) ⌃> U> X · 1 1 =Ik, ⌃ (⌃>) ⌃> U> X 1 · Denoisingx =(U Autoencoder, k ⌃ k, k) h(x ) h(x)= ⌃ k, k (U , k)1> xW⇤ W • · =I k, · ⌃ U> X · ⇣ 1 ⌘ • Idea: representation (shouldt) be1 robust (tot) introduction1 T of noise:(t0 ) = ⌃ k, k (U , k)> X xb x x · • pT T t0=1 Ø random assignment of subset of ⇣ ⌘ inputs to 0, with pprobability(x x) ⌫ x P 1 Ø Similar to dropouts• on| the input layer x =(U , k ⌃ k, k) h(x) h(x)= ⌃ k, k (U , k)> xW⇤ W • x· = sigm( c + W⇤h(x)) · Ø Gaussian additive• noisee e ⇣ ⌘ (t) 1 (t)(t) 1 T (t(0t)) 2 xb l(f(x x)) + x(t) hx(x ) • • pT T||r t0=1 ||F • Reconstruction x computedb e ⇣ P ⌘ from the corruptedp(x x )input⌫ x 2 • |• (t) (t) 2 @h(x )j • Loss function compares x x(t) h(x ) = e e ||r ||F (t) reconstruction with the noiseless j k @xk ! input x X X (Vincent et al., ICML 2008) l(f(x(t))) = x(t) log(x(t))+(1 x(t)) log(1 x(t)) • k k k k k P ⇣ ⌘ b b 2 2 (t) Wl(f(xExtracting)) W and Composing Robust Features with Denoising Autoencoders • r p(X X)=p(x µq )(Xµ X). p(Y ) is a uniform prior over where we moved the maximization outside of the ex- |• | D | Y [0, 1]d0 . This defines a generative model with pa- pectation because an unconstrained q?(X, Y X) can 2 µ h(x) | rametere • set ✓0 =e W0, b0 . We will use the previ- in principle perfectly model the conditional distribu- { } ously definedl(f(x))q0 =(X, X,Ylog p)=(x µq0)(X)q (X X) (Y ) tion needed to maximize (q?, X) for any X.e Now f✓ (X) • | D | if we replace the maximizationL over an unconstrained (equation 4) as an auxiliary1 model1 in the context2 of p(x µ)= exp( (x µ )e ) µ = c +? W h(x) a variational approximation(2⇡e)D/2 of2 thek log-likelihoodek k of q by⇤ the maximization over thee parameters e✓ of our • | 0 p(X). Note that we abuse notationP to make it lighter, q (appearing in f✓ that maps an x to a y), we get AA= U ⌃ V> U , k ⌃ k, k V>, k B a lower bound on : max EE [ (q0, X)] and use• the same letters X,· X and Y for· di↵erent H H ✓0,✓{ q0(X) L } setse of random variables representing the same quan- Maximizing this lower bound, we find e tity under• di↵erent distributions:e p or q0. Keep in e B⇤ = arg minarg maxA EE B F [ (q0, X)] B s.t. rank(B)=k || q0||(X) mind that whereas we had the dependency structure ✓,✓0 { L } 0 X X Y for q or q , we have Y X X for p. e ! !B⇤ = U , k ⌃ k, k V>, k! ! e p(X, X,Y ) · · = arg max EE 0 log Since p contains a corruption operation at the last q (X,X,Y ) 0 e e ✓,✓0 " q (X, Y X)# generative stage, we propose to fit p(X) to corrupted1 (t) (t) 2 e | arg min (xk xk ) arg min Xe W⇤h(X) F training samples. Performing maximum✓ likelihood2 fit- = arg max EE || log p(X, X,Y|| ) t k W⇤,h(Xq0)(X,X,Y) e 0 X X ✓,✓0 ting for samples drawn from q (X) correspondse to min- h i imizing the cross-entropy, or maximizing b e 0 e + EE q0(X) IH [q (X, Y X)] x(t) X = U ⌃ V e | = max IH>(q0(X) p(X)) h i • e H ✓0 { k } = arg max EE q0(X,X,Y ) log p(X,eX,Y ) . ✓,✓ arg min X W⇤h(X) F = W⇤ U , 0 k ⌃ k, k, h(X) V>, k = max EE 0 [log p(X)] . (6) · h · i qW(X⇤,)h(Xe) || e || e ✓0 { } e Note that ✓ only occurs in Y = f✓(X), and ✓0 only ? e h(X)=V>, k Let q (X, Y X) be a conditional density,e the quan- occurs· in p(X Y ). The last line is therefore obtained | | 1 ? p(X,X,Y ) = V>, k (X0 > X) (X> X) 0 e tity (q , X)=EE q?(X,Y X) log ? is a lower because· q (X X) q (X X)q (X) (none of which de- L | q (X,Y X) | / D | 1 e e | = V (V ⌃ U U0⌃ V ) (V ⌃ U X) bound on log p(X) since theh following cani be shown to pends>, k on (✓, ✓0>)), and> q (Y X>) is deterministic,> > i.e., its e e · | e ? entropy is constant,e irrespectivee1 of (✓, ✓0). Hence the be true for any q : = V>, k (V ⌃> ⌃ V>) V ⌃> U> X entropy· of q0(X, Y X)=q0e(Y X)q0(X X), does not e? ? | 1 | | log p(X)= (q , X)+ID KL(q (X, Y X) p(X, Y X)) = varyV>, withk V (✓⌃,>✓ ).⌃) Finally,V> V following⌃> U> X from above, we L | k | · 0 obtain our training criterion1 (eq. 5): = V>, k V (⌃> ⌃) e ⌃> U> Xe e Also ite is easy to verifye that the bounde is tight whene · ? 1 0 q (X, Y X)=p(X, Y X), where the ID KL becomes 0. =Ik, (⌃> ⌃) ⌃> U> X arg max EE q0(X)[ (q , X)] | | ? · L We can thus write log p(X) = max ? (q , X), and ✓,✓10 1 q =Ik, ⌃ (⌃>) ⌃> U> X consequentlye rewrite equatione 6 as L · e = arg max1 EE q0(X,X,Y )[log[ep(Y )p(X Y )p(X X)]] =Ik, ⌃✓ ,✓ U> X | | e ? e · 0 = max EE 0 [max (q , X)] 1 q (X) ? = ⌃ (U , k)> Xe H ✓0 { q L } =k, argk max EE 0 [log p(X Y )] Denoising Autoencoder · q (X,X,Y ) e ? ✓,✓0 | = max EE 0 e [ (q , X)] (7) ? q (X) e ✓0,q { L } e = arg max EE q0(X,X)[log p(X Y = f✓(X))] e 1 ✓,✓0 | x =(U , k ⌃ k, k) h(x) eh(x)= ⌃ k, k (U , k)> xW⇤ W • · x˜ · e = arg min EE L X, g (f (Xe)) g (f (x˜)) ⇣ ⌘ q0(X,X) IH ✓0 ✓ ✓0 ✓ ✓,✓ (t) 1 (t) 1 T (t0) 0 xb x T t =1 x h ⇣ ⌘i • pT 0 e e x where the third line is obtained because (✓, ✓ ) q (x˜ x) ⇣ P ⌘ 0 p(x x)| ⌫ x • D| x˜ have no influence on EE q0(X,X,Y )[log p(Y )] because we chose p(Y ) uniform, i.e. constant, nor on x e e x e EE 0 [log p(X X)], and the last line is obtained q (X,X) | by inspection of the definition of LIH in eq. 2, when e p(X Y = f✓(X))e is a . g✓ (f✓ (X)) Figure 2. Manifold learning perspective. Suppose | B 0 training data ( ) concentrate near a low-dimensional man- e ⇥ 4.3. Other Theoreticale Perspectives ifold. Corrupted examples (.) obtained by applying cor- ruption process q (X X) will lie farther from the manifold. Information Theoretic Perspective: Consider D | The model learns with p(X X) to “project them back” onto X q(X), q unknown, Y = f✓(X). It can easily | the manifold. Intermediatee representation Y can be inter- be⇠ shown2 (Vincent et al., 2008) that minimizing the preted as a coordinate systeme for points on the manifold. expected reconstruction error amountse to maximizing ExExtratractcintigngandandCompCompososingingRobuRobustsFteatFeaturesureswithwithDenDenoisioisingngAutoAutoencoencodersders
TabTable l1.e 1.CoCompmparisoarison nofofstsactackedkeddenoisdenoisingingautoautoencoencodersders(S(SdA-dA-3)3w) ithwithotherothermomodelsdels. . TestTesterrorerrorratrate one onallalconsil consideredderedclassclassificatificationionproblproblemsemsis isrepreportedortedtogettogetheher wir withtah95%a 95%confidenceconfidenceintinervteal.rval.BeBstestperformerperformer is isininbolbd,old,asaswelwlelasl asthosethoseforfowhicr which hconfidenceconfidenceintervintervalsalsovoevrlap.erlap.SdA-SdA-3 3appappearsearstotoacachievhieve peerformperformancance supe superioreriororor eqequivuivalenalent ttottheo thebesbtesothert othermomdeodel onl onallalprobleml problems excepts exceptbg-rbg-ranadn.dF.orForSdASdA-3,-3,wewaelsoalsoindiindicatcate thee thefracfractiotino n of ofdestrodestroyedyed inputinputcocompmonenponents,ts,asaschosechosen bnybproy properpermomodeldelselselection.ection.NotNote thate thatSAASAA-3-3is ieqs equivuivalenalent ttotSdAo SdA-3-3withwith = =0%.0%.
DatasetDataset SVMSVMrbfrbf SVMSVMpolpoly y DBN-1DBN-1 SAASAA-3-3 DBN-3DBN-3 SdA-SdA-3 3( () ) basicbasic 3.033.030.150.15 3.693.690.170.17 3.943.940.170.17 3.463.460.160.16 3.113.110.150.15 2.802.800.140.14(10%)(10%) ±± ±± ±± ±± ±± ±± rotrot 11.1111.110.280.28 15.4215.420.320.32 14.6914.690.310.31 10.3010.300.270.27 10.3010.300.270.27 10.2910.290.270.27(10%)(10%) ±± ±± ±± ±± ±± ±± bg-rbg-ranand d 14.5814.580.310.31 16.6216.620.330.33 9.809.800.260.26 11.2811.280.280.28 6.736.730.220.22 10.3810.380.270.27(40%)(40%) ±± ±± ±± ±± ±± ±± bg-ibg-imgmg 22.6122.610.370.37 24.0124.010.370.37 16.1516.150.320.32 23.0023.000.370.37 16.3116.310.320.32 16.6816.680.330.33(25%)(25%) ±± ±± ±± ±± ±± ±± rot-brot-bg-imgg-img 55.1855.180.440.44 56.4156.410.430.43 52.2152.210.440.44 51.9351.930.440.44 47.3947.390.440.44 44.4944.490.440.44(25%)(25%) ±± ±± ±± ±± ±± ±± rectrect 2.152.150.130.13 2.152.150.130.13 4.714.710.190.19 2.412.410.130.13 2.602.600.140.14 1.991.990.120.12(10%)(10%) ±± ±± ±± ±± ±± ±± rectre-ctimg-img 24.0424.040.370.37 24.0524.050.370.37 23.6923.690.370.37 24.0524.050.370.37 22.5022.500.370.37 21.5921.590.360.36(25%)(25%) ±± ±± ±± ±± ±± ±± convexconvex 19.1319.130.340.34 19.8219.820.350.35 19.9219.920.350.35 18.4118.410.340.34 18.6318.630.340.34 19.0619.060.340.34(10%)(10%) ±Learned± ±± Filters± ± ±± ±± ±±
Non-corrupted 25% corrupted input
(a)(a)NoNodesdestrotyroedyedinputinputs s (b)(b)25%25%destructiondestruction (c)(c50%) 50%destructdestructionion
(Vincent et al., ICML 2008)
(d)(d)NeuronNeuronA A(0%(0%, 10, 10%,%20%,, 20%,50%50%desdestruction)truction) (e)(e)NeuronNeuronB B(0%,(0%,10%,10%,20%20%, 50%, 50%destructdestruction)ion)
FigureFigure3.3.FiFiltersltersobotainedbtainedafteraftertrainingtrainingththe efirsfirst dent denoisingoisingauautotoenencocoder.der. (a-(a-c)c)shoshow wsomsome eofofthethefiltersfiltersobtainedobtainedafteraftertraitrainingninga adenoisidenoisingngautoautoencoencoderderononMMNISTNISTsamsamplesples, w, ithwithincreasiincreasingng desdestructiontructionlevleevlsels . T. heThefiltersfiltersatatthethesamsaempeospitionositioninithen thethreethreeimimagesagesarearerelatedrelatedonlyonlybybthey thefactfactthatthatthetheautautoencooencodersders werewerestartstartededfromfromthethesamesamerandomrandominiinitialitializazationtionpoinpoint. t. (d(d) and) and(e)(e)zozoomominionn onthethefiltfilerstersobtaineobtained ford fortwtowofo ofthetheneurons,neurons,agaiagain forn forincreasiincreasingngdesdestrutructionctionlevleevlse.ls. AsAscancanbebsee en,seen,withwithnonnoiso noise, e,manmany yfiltersfiltersremairemain nsimsimilailrlayrlyuninuninterestingteresting(undi(undististinctivnctive alme almostostuniuniformformgreygreypatpatches).ches). AsAswewiencincreasrease thee thenoisnoise leve level,eldenois, denoisingingtrainitrainingngforcforcesesthethefiltersfilterstotodi diere erentiatentiatemore,more,andandcapturecapturemoremoredistinctivdistinctive e features.features.HigheHigher noiser noiselevlevelseltsendtendtotionducinduce lese less lsocallocalfilters,filters,asasexpexpectected.ed.OneOnecancandisdistinguistinguish hdi dieren erent kit nkidsndsofoffiltfilersters, , fromfromlocallocalblobblobdetdetectors,ectors,totostrokstroke detectors,e detectors,andandsomesomefulfull clharaccharacterterdetectdetectorsorsatatthethehigherhighernoisnoise leevlevels.els. Extracting and Composing RobusExt Ftraeatcturesing andwithCompDenoosisiingng AutoRobuencost Featdersures with Denoising Autoencoders
Table 1. Comparison of stacked denoisTableing1. Coautompencoarisodersn of(SstdA-ack3ed) wdithenoisothering amoutodelsenco. ders (SdA-3) with other models. Test error rate on all considered classificatTestionerrorproblratemse onisalrepl consiortedderedtogetchelassr wiificatth aion95%problconfidenceems is repinortedterval.togetBestheprerformerwith a 95% confidence interval. Best performer is in bold, as well as those for whichisconfidencein bold, asintervwellalsasothoseverlap.forSdA-whic3haconfidenceppears to acinhievtervealsperformoverlap.ancSdA-e sup3eriorappearsor to achieve performance superior or equivalent to the best other model on aleqluivproblemalent tsoexceptthe besbg-rt otherand. mForodeSdAl on-3,allwprobleme also indis exceptcate thebg-rfracandti.onFor ofSdAdestro-3, wyeedalso indicate the fraction of destroyed input components, as chosen by properinputmodelcomselpection.onents,Notas cehosethatn SAAby pro-3pisereqmouivdelalenselt ection.to SdA-3Notweiththat =SAA0%.-3 is equivalent to SdA-3 with = 0%.
Dataset SVMrbf SVMpolDatasety DBN-1SVMrbf SAASVM-3 poly DBN-3DBN-1 SdA-SAA3-3( ) DBN-3 SdA-3 ( ) basic 3.03 0.15 3.69 0.17basic 3.94 3.030.17 0.153.46 3.690.16 0.17 3.11 3.940.15 0.172.80 3.460.14 (10%)0.16 3.11 0.15 2.80 0.14 (10%) ± ± ± ± ± ± ± ± ± ± ± ± rot 11.11 0.28 15.42 0.32rot 14.69 11.110.31 0.2810.30 15.420.27 0.3210.30 14.690.27 0.3110.2910.300.270.27(10%) 10.30 0.27 10.29 0.27 (10%) ± ± ± ± ± ± ± ± ± ± ± ± bg-rand 14.58 0.31 16.62 0.33bg-rand 9.80 14.580.26 0.3111.2816.620.28 0.336.73 9.800.22 0.2610.3811.280.27 (40%)0.28 6.73 0.22 10.38 0.27 (40%) ± ± ± ± ± ± ± ± ± ± ± ± bg-img 22.61 0.37 24.01 0.37bg-img16.15 22.610.32 0.3723.0024.010.37 0.3716.3116.150.320.3216.68 23.000.33 (25%)0.37 16.31 0.32 16.68 0.33 (25%) ± ± ± ± ± ± ± ± ± ± ± ± rot-bg-img 55.18 0.44 56.41 0.43rot-bg-img52.21 55.180.44 0.4451.9356.410.44 0.4347.3952.210.44 0.4444.49 51.930.44 (25%)0.44 47.39 0.44 44.49 0.44 (25%) ± ± ± ± ± ± ± ± ± ± ± ± rect 2.15 0.13 2.15 0.13rect 4.71 2.150.19 0.132.412.150.130.13 2.60 4.710.14 0.191.99 2.410.12 (10%)0.13 2.60 0.14 1.99 0.12 (10%) ± ± ± ± ± ± ± ± ± ± ± ± rect-img 24.04 0.37 24.05 0.37rect-img23.69 24.040.37 0.3724.0524.050.37 0.3722.5023.690.37 0.3721.59 24.050.36 (25%)0.37 22.50 0.37 21.59 0.36 (25%) ± ± ± ± ± ± ± ± ± ± ± ± convex 19.13 0.34 19.82 0.35convex19.92 19.130.35 0.3418.41 19.820.34 0.3518.63 19.920.34 0.3519.0618.410.340.34(10%) 18.63 0.34 19.06 0.34 (10%) ± ± ± ±Learned± ± Filters± ± ± ± ± ±
Non-corrupted 50% corrupted input
(a) No destroyed inputs (a) No(b)des25%trodestructionyed inputs (b)(c25%) 50%destructiondestruction (c) 50% destruction
(Vincent et al., ICML 2008)
(d) Neuron A (0%, 10%, 20%, 50% des(d)truction)Neuron A (0%, 10%, 20%,(e) Neuron50% desBtruction)(0%, 10%, 20%, 50% destruct(e) Neuronion) B (0%, 10%, 20%, 50% destruction)
Figure 3. Filters obtained after trainingFigure th3. eFifiltersrst denobtainedoising auaftertoentrainingcoder. the first denoising autoencoder. (a-c) show some of the filters obtained(a-c)aftershowtraisomninge ofa denoisithe filtersng aobtainedutoencoderafteron tMraiNISTning asamdenoisiples, ngwithautoincreasiencoderng on MNIST samples, with increasing destruction levels . The filters at the sdesamteructionpositionlevinelsthe . threeThe filtersimagesataretherelatedsame ponlyositionbyithen thefactthreethatimtheagesautareoencorelatedders only by the fact that the autoencoders were started from the same random iniwtereializastarttionedpoinfromt. the same random initialization point. (d) and (e) zoom in on the filters obtaine(d) dandfor(e)twozoofomtheinneurons,on the filagaitersnobtainefor increasid forngtwdeso oftruthectionneurons,levels.again for increasing destruction levels. As can be seen, with no noise, many filAsterscanremaibe senen,simwithilarlynouninnoistee,restingmany(filunditersstiremainctivne almsimilostarluniy uninformterestinggrey pat(undiches).stinctive almost uniform grey patches). As we increase the noise level, denoisAsingwtrainie incngreasforce thees thenoisfilterse levelt,odenoisdi ereingntiatetrainimore,ng forcandes capturethe filtersmoreto didi stinctiverentiatee more, and capture more distinctive features. Higher noise levels tend to ifeatures.nduce lessHighelocalrfilters,noise levaselexps tectended.to Oneinduccane lesdiss ltinguisocal filters,h di erenas expt kiectndsed.ofOnefilterscan, distinguish di erent kinds of filters, from local blob detectors, to stroke detectors,from localandblobsomedetfulecltors,characto terstrokdetecte detectors,ors at theandhighersomenoisfullecharaclevels.ter detectors at the higher noise levels. Predic ve Sparse Decomposi on
Binary Features z
L Sparsity Encoder 1 filters W. Dz z=σ(Wx) Decoder Sigmoid filters D function path Real-valued Input x
At training time path Decoder Encoder (Kavukcuoglu, Ranzato, Fergus, LeCun, 2009) Stacked Autoencoders
Class Labels
Decoder Encoder
Features
Sparsity Decoder Encoder
Features
Sparsity Decoder Encoder
Input x Stacked Autoencoders
Class Labels
Decoder Encoder
Features
Sparsity Decoder Encoder
Features Greedy Layer-wise Learning. Sparsity Decoder Encoder
Input x Stacked Autoencoders
Class Labels • Remove decoders and use feed-forward part. Encoder
• Standard, or Features convolu onal neural network architecture. Encoder
• Parameters can be Features fine-tuned using backpropaga on. Encoder
Input x Deep Autoencoders Decoder 30 W 4 Top 500 RBM T T W 1 W 1 + 8 2000 2000 T T 500 W 2 W 2 + 7 1000 1000 W 3 1000 W T W T RBM 3 3 + 6 500 500 W T T 4 W 4 + 5 1000 30 Code layer 30
W 4 W 44+ W 2 2000 500 500 RBM W 3 W 33+ 1000 1000
W 2 W 22+ 2000 2000 2000
W 1 W 1 W 11+
RBM Encoder Pretraining Unrolling Fine tuning Deep Autoencoders
• 25x25 – 2000 – 1000 – 500 – 30 autoencoder to extract 30-D real- valued codes for Olive face patches.
• Top: Random samples from the test dataset. • Middle: Reconstruc ons by the 30-dimensional deep autoencoder. • Bo om: Reconstruc ons by the 30-dimen noal PCA. (Hinton and Salakhutdinov, Science 2006) Informa on Retrieval
European Community Interbank Markets Monetary/Economic 2-D LSA space
Energy Markets Disasters and Accidents
Leading Legal/Judicial Economic Indicators
Government Accounts/ Borrowings Earnings • The Reuters Corpus Volume II contains 804,414 newswire stories (randomly split into 402,207 training and 402,207 test). • “Bag-of-words” representa on: each ar cle is represented as a vector containing the counts of the most frequently used 2000 words in the training set. (Hinton and Salakhutdinov, Science 2006) Seman c Hashing
European Community Monetary/Economic
Address Space Disasters and Accidents
Semantically Similar Documents
Semantic Hashing Government Function Energy Markets Borrowing
Document Accounts/Earnings
• Learn to map documents into seman c 20-D binary codes. • Retrieve similar documents stored at the nearby addresses with no search at all. (Salakhutdinov and Hinton, SIGIR 2007) Searching Large Image Database using Binary Codes • Map images into binary codes for fast retrieval.
• Small Codes, Torralba, Fergus, Weiss, CVPR 2008 • Spectral Hashing, Y. Weiss, A. Torralba, R. Fergus, NIPS 2008 • Kulis and Darrell, NIPS 2009, Gong and Lazebnik, CVPR 2011 • Norouzi and Fleet, ICML 2011, Unsupervised Learning
Non-probabilis c Models Probabilis c (Genera ve) Ø Sparse Coding Models Ø Autoencoders Ø Others (e.g. k-means)
Non-Tractable Models Tractable Models Ø Genera ve Adversarial Ø Fully observed Ø Boltzmann Machines Networks Ø Belief Nets Varia onal Ø Moment Matching Ø NADE Autoencoders Networks Ø PixelRNN Ø Helmholtz Machines Ø Many others…
Explicit Density p(x) Implicit Density Talk Roadmap
• Basic Building Blocks: Ø Sparse Coding Ø Autoencoders
• Deep Genera ve Models Ø Restricted Boltzmann Machines Ø Deep Belief Networks and Deep Boltzmann Machines Ø Helmholtz Machines / Varia onal Autoencoders
• Genera ve Adversarial Networks
• Model Evalua on Deep Genera ve Model
Sanskrit Model P(image)
25,000 characters from 50 alphabets around the world.
• 3,000 hidden variables • 784 observed variables (28 by 28 images) • About 2 million parameters
Bernoulli Markov Random Field Deep Genera ve Model
Condi onal Simula on
P(image|par al image) Bernoulli Markov Random Field Deep Genera ve Model
Condi onal Simula on
Why so difficult? 28
28
possible images!
P(image|par al image) Bernoulli Markov Random Field BRIEF ARTICLE
THE AUTHOR
Maximum likelihood
✓⇤Fully Observed Models = arg max Ex pdata log pmodel(x ✓) ✓ ⇠ | Fully-visible belief• Explicitly model condi onal probabili es: net n pmodel(x)=pmodel(x1) pmodel(xi x1,...,xi 1) | Yi=2 Each condi onal can be a complicated neural network • A number of successful models, including
Ø NADE, RNADE (Larochelle, et.al. 20011) Ø Pixel CNN (van den Ord et. al. 2016) Ø Pixel RNN (van den Ord et. al. 2016)
Pixel CNN
1 Restricted Boltzmann Machines Feature Detectors Pair-wise Unary hidden variables
Image visible variables
RBM is a Markov Random Field with: • Stochas c binary visible variables • Stochas c binary hidden variables • Bipar te connec ons. Markov random fields, Boltzmann machines, log-linear models. Learning Features Observed Data Learned W: “edges” Subset of 25,000 characters Subset of 1000 features
New Image:
= ….
Logis c Func on: Suitable for modeling binary images Model Learning Hidden units
Given a set of i.i.d. training examples , we want to learn model parameters . Maximize log-likelihood objec ve:
Image visible units
Deriva ve of the log-likelihood:
Difficult to compute: exponen ally many configura ons Model Learning hidden variables Deriva ve of the log-likelihood:
Image visible variables Easy to compute exactly Difficult to compute: exponen ally many configura ons. Use MCMC
Approximate maximum likelihood learning Approximate Learning • An approxima on to the gradient of the log-likelihood objec ve:
• Replace the average over all possible input configura ons by samples. • Run MCMC chain (Gibbs sampling) star ng from the observed examples. • Ini alize v 0 = v • Sample h 0 from P(h | v0) • For t=1:T - Sample vt from P(v | ht-1) - Sample ht from P(h | vt) Approximate ML Learning for RBMs Run Markov chain (alterna ng Gibbs Sampling):
…
Data T=1 T= infinity
Equilibrium Distribu on Contras ve Divergence A quick way to learn RBM: • Start with a training vector on the visible units. • Update all the hidden units in parallel. • Update the all the visible units in parallel to get a “reconstruc on”. Data Reconstructed Data • Update the hidden units again.
Update model parameters:
Implementa on: ~10 lines of Matlab code. (Hinton, Neural Computation 2002) RBMs for Real-valued Data
hidden variables Pair-wise Unary
Image visible variables
Gaussian-Bernoulli RBM: • Stochas c real-valued visible variables • Stochas c binary hidden variables • Bipar te connec ons. RBMs for Real-valued Data
hidden variables Pair-wise Unary
Image visible variables
Learned features (out of 10,000) 4 million unlabelled images RBMs for Real-valued Data
Learned features (out of 10,000) 4 million unlabelled images
= 0.9 * + 0.8 * + 0.6 * … New Image RBMs for Word Counts
Pair-wise Unary
1 D K F D K F P (v, h)= exp W k vkh + vkbk + h a 0 ✓ (✓) 0 ij i j i i j j1 i=1 j=1 i=1 j=1 Z X kX=1 X X kX=1 X 0 @ A 0 1 k F k exp bi + hjWij k j=1 0 P✓(v =1h)= i | K ⇣ q F ⌘ q q=1 exp biP+ j=1 hjWij P ⇣ P ⌘ Replicated So max Model: undirected topic model: • Stochas c 1-of-K visible variables. • Stochas c binary hidden variables • Bipar te connec ons. (Salakhutdinov & Hinton, NIPS 2010, Srivastava & Salakhutdinov, NIPS 2012) RBMs for Word Counts
Pair-wise Unary
1 D K F D K F P (v, h)= exp W k vkh + vkbk + h a 0 ✓ (✓) 0 ij i j i i j j1 i=1 j=1 i=1 j=1 Z X kX=1 X X kX=1 X 0 @ A 0 1 k F k exp bi + hjWij k j=1 0 P✓(v =1h)= i | K ⇣ q F ⌘ q q=1 exp biP+ j=1 hjWij P ⇣ P ⌘
Learned features: ``topics’’ russian clinton computer trade stock Reuters dataset: russia house system country wall 804,414 unlabeled moscow president product import street newswire stories yeltsin bill so ware world point dow Bag-of-Words soviet congress develop economy RBMs for Word Counts
One-step reconstruc on from the Replicated So max model. Collabora ve Filtering
Binary hidden: user preferences h Learned features: ``genre’’ W1 Fahrenheit 9/11 Independence Day v Bowling for Columbine The Day A er Tomorrow The People vs. Larry Flynt Con Air Mul nomial visible: user ra ngs Canadian Bacon Men in Black II La Dolce Vita Men in Black Ne lix dataset: 480,189 users Friday the 13th Scary Movie The Texas Chainsaw Massacre Naked Gun 17,770 movies Children of the Corn Hot Shots! Over 100 million ra ngs Child's Play American Pie The Return of Michael Myers Police Academy
(Salakhutdinov, Mnih, Hinton, ICML 2007) Different Data Modali es • Binary/Gaussian/ So max RBMs: All have binary hidden variables but use them to model different kinds of data.
Binary 0 0 0 1 0 Real-valued 1-of-K
• It is easy to infer the states of the hidden variables: Product of Experts The joint distribu on is given by:
Marginalizing over hidden variables: Product of Experts
government clinton bribery mafia stock … authority house corrup on business wall power president dishonesty gang street empire bill corrupt mob point federa on congress fraud insider dow
Topics “government”, ”corrup on” and ”mafia” can combine to give very high probability to a word “Silvio Silvio Berlusconi Berlusconi”. Product of Experts The joint distribu on is given by:
Marginalizing over hidden variables: Product of Experts 50 Replicated 40 Softmax 50−D government clinton bribery mafia stock … authority house 30 corrup on business wall street power president dishonesty LDA 50−D gang point empire bill 20 corrupt mob federa on congress fraud insider dow Precision (%) 10 Topics “government”, ”corrup on” and ”mafia” can combine to give very 0.001 0.006 0.051 0.4 1.6 6.4 25.6 100 Recall high probability to a word “(%) Silvio Silvio Berlusconi Berlusconi”. Talk Roadmap
• Basic Building Blocks (non-probabilis c models): Ø Sparse Coding Ø Autoencoders
• Deep Genera ve Models Ø Restricted Boltzmann Machines Ø Deep Boltzmann Machines Ø Helmholtz Machines / Varia onal Autoencoders
• Genera ve Adversarial Networks Deep Belief Network h3 • Probabilis c Genera ve model. W3 • Contains mul ple layers of nonlinear h2 representa on. W2 h1 • Fast, greedy layer-wise pretraining algorithm. W1 v • Inferring the states of the latent variables in highest layers is easy.
• Inferring the states of the latent variables in highest layers is easy. (Hinton et al. Neural Computation 2006) Deep Belief Network
Low-level features: Edges
Built from unlabeled inputs.
Input: Pixels Image (Hinton et al. Neural Computation 2006) Deep Belief Network
Internal representa ons capture higher-order sta s cal structure
Higher-level features: Combina on of edges
Low-level features: Edges
Built from unlabeled inputs.
Input: Pixels Image (Hinton et al. Neural Computation 2006) Deep Belief Network
RBM
Hidden Layers Sigmoid Belief Network
Visible Layer Deep Belief Network The joint probability Deep Belief Network distribu on factorizes:
RBM
Sigmoid Sigmoid Belief RBM Belief Network Network Deep Belief Network Approximate Genera ve h3 Inference Process W3
h2
W2
h1 W1
v DBN Layer-wise Training
• Learn an RBM with an input layer v and a hidden layer h.
h W1
v DBN Layer-wise Training
• Learn an RBM with an input layer v and a hidden layer h. • Treat inferred values as the data for training 2nd-layer RBM.
2 • Learn and freeze 2nd layer h RBM. W1⊤ h1 W1
v DBN Layer-wise Training
• Learn an RBM with an input Unsupervised Feature Learning. layer v and a hidden layer h.
• Treat inferred values h3 as the data W3 for training 2nd-layer RBM.
2 • Learn and freeze 2nd layer h RBM. W2
• Proceed to the next layer. h1 W1
v DBN Layer-wise Training
• Learn an RBM with an input Unsupervised Feature Learning. layer v and a hidden layer h.
• Treat inferred values h3 as the data W3 for training 2nd-layer RBM.
2 • Learn and freeze 2nd layer h RBM. W2 Layerwise pretraining improves • Proceed to the next layer. h1 varia onal lower bound W1
v Why this Pre-training Works?
• Greedy training improves varia onal lower bound!
h W1 • For any approxima ng v distribu on Why this Pre-training Works?
• Greedy training improves varia onal lower bound.
• RBM and 2-layer DBN are equivalent
when h2 • The lower bound is ght and 1 the log-likelihood improves by W ⊤ 1 greedy training. h W1
• For any approxima ng v distribu on
Train 2nd-layer RBM Learning Part-based Representa on Faces Convolu onal DBN Groups of parts.
h3 W3 h2 Object Parts W2
h1 W1
v Trained on face images.
(Lee, Grosse, Ranganath, Ng, ICML 2009) Learning Part-based Representa on Faces Cars Elephants Chairs
(Lee, Grosse, Ranganath, Ng, ICML 2009) Learning Part-based Representa on
Groups of parts.
Class-specific object parts
Trained from mul ple classes (cars, faces, motorbikes, airplanes).
(Lee, Grosse, Ranganath, Ng, ICML 2009)