
Deep Learning II Unsupervised Learning Russ Salakhutdinov Machine Learning Department Carnegie Mellon University Canadian Institute of Advanced Research Talk Roadmap Part 1: Supervised Learning: Deep Networks Part 2: Unsupervised Learning: Learning Deep Generave Models Part 3: Open Research QuesAons Unsupervised Learning Non-probabilisAc Models ProbabilisAc (Generave) Ø Sparse Coding Models Ø Autoencoders Ø Others (e.g. k-means) Non-Tractable Models Tractable Models Ø Generave Adversarial Ø Fully observed Ø Boltzmann Machines Networks Ø Belief Nets Variaonal Ø Moment Matching Ø NADE Autoencoders Networks Ø PixelRNN Ø Helmholtz Machines Ø Many others… Explicit Density p(x) Implicit Density Talk Roadmap • Basic Building Blocks: Ø Sparse Coding Ø Autoencoders • Deep Generave Models Ø Restricted Boltzmann Machines Ø Deep Belief Networks and Deep Boltzmann Machines Ø Helmholtz Machines / Variaonal Autoencoders • Generave Adversarial Networks • Model Evaluaon Sparse Coding • Sparse coding (Olshausen & Field, 1996). Originally developed to explain early visual processing in the brain (edge detecAon). • Objecve: Given a set of input data vectors learn a dicAonary of bases such that: Sparse: mostly zeros • Each data vector is represented as a sparse linear combinaon of bases. Sparse Coding Natural Images Learned bases: “Edges” New example = 0.8 * + 0.3 * + 0.5 * x = 0.8 * + 0.3 * + 0.5 * [0, 0, … 0.8, …, 0.3, …, 0.5, …] = coefficients (feature representaon) Slide Credit: Honglak Lee Sparse Coding: Training • Input image patches: • Learn dicAonary of bases: ReconstrucAon error Sparsity penalty • Alternang OpAmizaon: 1. Fix dicAonary of bases and solve for acAvaons a (a standard Lasso problem). 2. Fix acAvaons a, opAmize the dicAonary of bases (convex QP problem). Sparse Coding: TesAng Time • Input: a new image patch x* , and K learned bases • Output: sparse representaon a of an image patch x*. = 0.8 * + 0.3 * + 0.5 * x* = 0.8 * + 0.3 * + 0.5 * [0, 0, … 0.8, …, 0.3, …, 0.5, …] = coefficients (feature representaon) Image Classificaon Evaluated on Caltech101 object category dataset. Classification Algorithm (SVM) Learned Input Image bases Features (coefficients) 9K images, 101 classes Algorithm Accuracy Baseline (Fei-Fei et al., 2004) 16% PCA 37% Sparse Coding 47% Slide Credit: Honglak Lee (Lee, Battle, Raina, Ng, NIPS 2007) InterpreAng Sparse Coding a Sparse features a Implicit g(a) Explicit f(x) Linear nonlinear Decoding encoding x’ x • Sparse, over-complete representaon a. • Encoding a = f(x) is implicit and nonlinear funcAon of x. • ReconstrucAon (or decoding) x’ = g(a) is linear and explicit. Autoencoder Feature Representation Feed-back, Feed-forward, generative, bottom-up top-down Decoder Encoder path Input Image • Details of what goes insider the encoder and decoder maer! • Need constraints to avoid learning an idenAty. Autoencoder Binary Features z Decoder Encoder filters D filters W. Dz z=σ(Wx) Linear Sigmoid function function path Input Image x Autoencoder Binary Features z Encoder filters W. σ(WTz) z=σ(Wx) Decoder Sigmoid filters D function path Binary Input x • Need addiAonal constraints to avoid learning an idenAty. • Relates to Restricted Boltzmann Machines (later). Autoencoders Hugo Larochelle Departement´ d’informatique Universite´ de Sherbrooke [email protected] October 16, 2012 Abstract Math for my slides “Autoencoders”. Autoencoders • Hugo Larochelle AutoencodersDepartement´ h(x)= d’informatique g(a(x)) Universite´ de Sherbrooke = sigm(b + Wx) • Feed-forward neural network trained to reproduce its input at the output [email protected] layer • Decoder October 16, 2012 x = o(a(x)) = sigm(c + W h(x)) Abstract ⇤ b b f(Mathx) forx myl(f slides(x)) “Autoencoders”. = (x x )2 l(f(x)) = For binary units (x log(x )+(1 x ) log(1 x )) • ⌘ k k − k − k k k − k − k P P • b b Encoder b b h(x)=g(a(x)) = sigm(b + Wx) • x = o(a(x)) = sigm(c + W⇤h(x)) b b f(x) x l(f(x)) = (x x )2 l(f(x)) = (x log(x )+(1 x ) log(1 x )) • ⌘ k k − k − k k k − k − k P P b b b b 1 1 Autoencoders Autoencoders Hugo Larochelle Hugo Larochelle Autoencoders Departement´ d’informatique Departement´ d’informatique Universite´ de Sherbrooke Universite´ de Sherbrooke Hugo Larochelle [email protected] [email protected]´ d’informatique October 16, 2012 Universite´ de Sherbrooke October 16, 2012 [email protected] Abstract October 16, 2012 Abstract Math for my slides “Autoencoders”. Math for my slides “Autoencoders”. • •Abstract h(x)=g(b + Wx) Math for my slides “Autoencoders”. h(x)=g(b + Wx) = sigm(b + Wx) = sigm(b + Wx) • • h(x)=g(a(x)) x = o(cLoss+ W⇤h (Functionx)) • = sigm(b + Wx) = sigm(c + W⇤h(x)) x = o(c + W h(x)) • Loss function for binary inputs ⇤ b = sigm(c + W⇤h(x)) 2 f(x) x l(f(x)) = (xk xk) l(f(x)) = (xk log(xk)+(1 xk) log(1 xk)) • ⌘ k• − − k − − b 2 P Ø Cross-entropyP error function (reconstruction loss) f(x) x l(f(x)) = (x x ) b b b x = • bo(a(x⌘)) k k − k = sigm(c + W⇤h(x)) P b b • Loss function for real-valued inputs b b f(x) x l(f(x)) = 1 (x x )2 l(f(x)) = (x log(x )+(1 x ) log(1 x )) • ⌘ 2 k k − k − k k k − k − k Ø (t) (t) (t) (t) l(f(xsum)) of =squaredx P differencesx (reconstruction loss) P a(x )b b b b • r Ø we use a linear activation− function at the output b a(x(t)) = b + Wx(t) b ( h(x(t)) = sigm(a(x(t))) (t) ( (t) a(x ) = c + W>h(x ) ( x(t) = sigm(a(x(t))) b ( (t) (t) b (t) b (t) l(f(x )) = x x ra(x ) ( − (t) (t) b cl(f(x )) = a(x(t))l(f(x )) r ( rb (t) (t) (t) l(f(x )) = W (t) l(f(x )) rh(x ) ( b ra(x ) (t) ⇣ (t) ⌘ (t) (t) a(x(t))l(f(x )) = h(x(bt))l(f(x )) [...,h(x )j(1 h(x )j),...] r 1 ( r − (t) ⇣ (t) ⌘ l(f(x )) = (t) l(f(x )) rb ( ra(x ) (t) (t) (t) (t) (t) > l(f(x )) = (t) l(f(x )) x > + h(x ) (t) l(f(x )) rW ( ra(x ) ra(x ) ⇣ ⌘ ⇣ ⌘ b 1 W⇤ = W> • 1 Autoencoder Linear Features z • If the hidden and output layers are linear, it will learn hidden units that are a linear funcAon of the data and minimize the squared error. Wz z=Wx • The K hidden units will span the same space as the first k principal components. The weight vectors Input Image x may not be orthogonal. • With nonlinear hidden units, we have a nonlinear generalizaon of PCA. l(f(x(t))) W • rW p(x µ) µ • | µ h(x) • l(fl((fx((xt))))) =W log p(x µ) • rW• − | p(x µ)= 1 exp( 1 (x µ )2) µ = c + W h(x) p(x µ) µ (2⇡)D/2 2 k k k ⇤ • |• | − − P µ h(xAA) = U ⌃ V> U , k ⌃ k, k V>, k B • • · · l(f(x)) = log p(x µ) • • − | B⇤ = arg min A B F 1 1 2 p(x µ)= exp( (x µ ) ) µ = c + WB⇤hs(.tx. )rank(B)=k || − || • | (2⇡)D/2 − 2 k k − k B⇤ = U , k ⌃ k,Pk V>, k AA= U ⌃ V>· U ,k ⌃ k, · kV>, k B • · · 1 (t) (t) 2 arg min (xk xk ) arg min X W⇤h(X) F • ✓ 2 − ≥ W ,h(X) || − || Bt ⇤ =k arg min A B F⇤ X XB s.t. rank(B)=k || − || b (t) B⇤ =xU ,Xk =⌃ Uk,⌃kVV>>, k • · · arg min X 1 W⇤h(tX) ) F(t) = W⇤ U , k ⌃ k, k, h(X) V>, k 2 · Warg⇤,h min(X) || − (xk ||xk ) arg min X W⇤h(X) F · ✓ 2 − ≥ || − || t k W⇤,h(X) X X h(X)=V>, k b · 1 (t) = V>, k (X> X)− (X> X) x X = U ⌃ V> · • 1 = V>, k (V ⌃> U> U ⌃ V>)− (V ⌃> U> X) · arg min X W⇤h(X) F = W⇤ U , k ⌃ k, k, h(X)1 V>, k || − || · · W⇤,h(X) = V>, k (V ⌃> ⌃ V>)− V ⌃> U> X · 1 h(X)=V>,=k V>, k V (⌃> ⌃)− V> V ⌃> U> X · · 1 1 = V>,=k (XV>> X)V− (⌃(X>>⌃X)−) ⌃> U> X · , k · 1 = V>, k (V ⌃> U> U ⌃1 V>)− (V ⌃> U> X) · =I k, (⌃> ⌃)− ⌃> U> X · 1 = V> (V ⌃> ⌃ 1V>)− V1 ⌃> U> X ,=Ik k, ⌃− (⌃>)− ⌃> U> X · · 1 1 = V>,=Ik V (k,⌃>⌃⌃−)−UV> >XV ⌃> U> X · · 1 1 = V>,=k V⌃−(⌃> ⌃()U− ,⌃k>)>UX> X · k, k · 1 =Ik, (⌃> ⌃)− ⌃> U> X · 1 1 =Ik, ⌃− (⌃>)− ⌃> U> X 1 · Denoisingx =(U Autoencoder, k ⌃ k, k) h(x ) h(x)= ⌃−k, k (U , k)1> xW⇤ W • · =I k, · ⌃− U> X · ⇣ 1 ⌘ • Idea: representation (shouldt) be1 robust (tot) introduction1 T of noise:(t0 ) = ⌃−k, k (U , k)> X xb x x · • pT − T t0=1 Ø random assignment of subset of ⇣ ⌘ inputs to 0, with pprobability(x x) ⌫ x P 1 Ø Similar to dropouts• on| the input layer x =(U , k ⌃ k, k) h(x) h(x)= ⌃−k, k (U , k)> xW⇤ W • x· = sigm( c + W⇤h(x)) · Ø Gaussian additive• noisee e ⇣ ⌘ (t) 1 (t)(t) 1 T (t(0t)) 2 xb l(f(x x)) + λ x(t) hx(x ) • • pT − T||r t0=1 ||F • Reconstruction x computedb e ⇣ P ⌘ from the corruptedp(x x )input⌫ x 2 • |• (t) (t) 2 @h(x )j • Loss function compares x x(t) h(x ) = e e ||r ||F (t) reconstruction with the noiseless j k @xk ! input x X X (Vincent et al., ICML 2008) l(f(x(t))) = x(t) log(x(t))+(1 x(t)) log(1 x(t)) • − k k k − k − k P ⇣ ⌘ b b 2 2 (t) Wl(f(xExtracting)) W and Composing Robust Features with Denoising Autoencoders • r p(X X)=p(x µq )(Xµ X).
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages68 Page
-
File Size-