Deep Learning II Unsupervised Learning

Deep Learning II Unsupervised Learning

Deep Learning II Unsupervised Learning Russ Salakhutdinov Machine Learning Department Carnegie Mellon University Canadian Institute of Advanced Research Talk Roadmap Part 1: Supervised Learning: Deep Networks Part 2: Unsupervised Learning: Learning Deep Generave Models Part 3: Open Research QuesAons Unsupervised Learning Non-probabilisAc Models ProbabilisAc (Generave) Ø Sparse Coding Models Ø Autoencoders Ø Others (e.g. k-means) Non-Tractable Models Tractable Models Ø Generave Adversarial Ø Fully observed Ø Boltzmann Machines Networks Ø Belief Nets Variaonal Ø Moment Matching Ø NADE Autoencoders Networks Ø PixelRNN Ø Helmholtz Machines Ø Many others… Explicit Density p(x) Implicit Density Talk Roadmap • Basic Building Blocks: Ø Sparse Coding Ø Autoencoders • Deep Generave Models Ø Restricted Boltzmann Machines Ø Deep Belief Networks and Deep Boltzmann Machines Ø Helmholtz Machines / Variaonal Autoencoders • Generave Adversarial Networks • Model Evaluaon Sparse Coding • Sparse coding (Olshausen & Field, 1996). Originally developed to explain early visual processing in the brain (edge detecAon). • Objecve: Given a set of input data vectors learn a dicAonary of bases such that: Sparse: mostly zeros • Each data vector is represented as a sparse linear combinaon of bases. Sparse Coding Natural Images Learned bases: “Edges” New example = 0.8 * + 0.3 * + 0.5 * x = 0.8 * + 0.3 * + 0.5 * [0, 0, … 0.8, …, 0.3, …, 0.5, …] = coefficients (feature representaon) Slide Credit: Honglak Lee Sparse Coding: Training • Input image patches: • Learn dicAonary of bases: ReconstrucAon error Sparsity penalty • Alternang OpAmizaon: 1. Fix dicAonary of bases and solve for acAvaons a (a standard Lasso problem). 2. Fix acAvaons a, opAmize the dicAonary of bases (convex QP problem). Sparse Coding: TesAng Time • Input: a new image patch x* , and K learned bases • Output: sparse representaon a of an image patch x*. = 0.8 * + 0.3 * + 0.5 * x* = 0.8 * + 0.3 * + 0.5 * [0, 0, … 0.8, …, 0.3, …, 0.5, …] = coefficients (feature representaon) Image Classificaon Evaluated on Caltech101 object category dataset. Classification Algorithm (SVM) Learned Input Image bases Features (coefficients) 9K images, 101 classes Algorithm Accuracy Baseline (Fei-Fei et al., 2004) 16% PCA 37% Sparse Coding 47% Slide Credit: Honglak Lee (Lee, Battle, Raina, Ng, NIPS 2007) InterpreAng Sparse Coding a Sparse features a Implicit g(a) Explicit f(x) Linear nonlinear Decoding encoding x’ x • Sparse, over-complete representaon a. • Encoding a = f(x) is implicit and nonlinear funcAon of x. • ReconstrucAon (or decoding) x’ = g(a) is linear and explicit. Autoencoder Feature Representation Feed-back, Feed-forward, generative, bottom-up top-down Decoder Encoder path Input Image • Details of what goes insider the encoder and decoder maer! • Need constraints to avoid learning an idenAty. Autoencoder Binary Features z Decoder Encoder filters D filters W. Dz z=σ(Wx) Linear Sigmoid function function path Input Image x Autoencoder Binary Features z Encoder filters W. σ(WTz) z=σ(Wx) Decoder Sigmoid filters D function path Binary Input x • Need addiAonal constraints to avoid learning an idenAty. • Relates to Restricted Boltzmann Machines (later). Autoencoders Hugo Larochelle Departement´ d’informatique Universite´ de Sherbrooke [email protected] October 16, 2012 Abstract Math for my slides “Autoencoders”. Autoencoders • Hugo Larochelle AutoencodersDepartement´ h(x)= d’informatique g(a(x)) Universite´ de Sherbrooke = sigm(b + Wx) • Feed-forward neural network trained to reproduce its input at the output [email protected] layer • Decoder October 16, 2012 x = o(a(x)) = sigm(c + W h(x)) Abstract ⇤ b b f(Mathx) forx myl(f slides(x)) “Autoencoders”. = (x x )2 l(f(x)) = For binary units (x log(x )+(1 x ) log(1 x )) • ⌘ k k − k − k k k − k − k P P • b b Encoder b b h(x)=g(a(x)) = sigm(b + Wx) • x = o(a(x)) = sigm(c + W⇤h(x)) b b f(x) x l(f(x)) = (x x )2 l(f(x)) = (x log(x )+(1 x ) log(1 x )) • ⌘ k k − k − k k k − k − k P P b b b b 1 1 Autoencoders Autoencoders Hugo Larochelle Hugo Larochelle Autoencoders Departement´ d’informatique Departement´ d’informatique Universite´ de Sherbrooke Universite´ de Sherbrooke Hugo Larochelle [email protected] [email protected]´ d’informatique October 16, 2012 Universite´ de Sherbrooke October 16, 2012 [email protected] Abstract October 16, 2012 Abstract Math for my slides “Autoencoders”. Math for my slides “Autoencoders”. • •Abstract h(x)=g(b + Wx) Math for my slides “Autoencoders”. h(x)=g(b + Wx) = sigm(b + Wx) = sigm(b + Wx) • • h(x)=g(a(x)) x = o(cLoss+ W⇤h (Functionx)) • = sigm(b + Wx) = sigm(c + W⇤h(x)) x = o(c + W h(x)) • Loss function for binary inputs ⇤ b = sigm(c + W⇤h(x)) 2 f(x) x l(f(x)) = (xk xk) l(f(x)) = (xk log(xk)+(1 xk) log(1 xk)) • ⌘ k• − − k − − b 2 P Ø Cross-entropyP error function (reconstruction loss) f(x) x l(f(x)) = (x x ) b b b x = • bo(a(x⌘)) k k − k = sigm(c + W⇤h(x)) P b b • Loss function for real-valued inputs b b f(x) x l(f(x)) = 1 (x x )2 l(f(x)) = (x log(x )+(1 x ) log(1 x )) • ⌘ 2 k k − k − k k k − k − k Ø (t) (t) (t) (t) l(f(xsum)) of =squaredx P differencesx (reconstruction loss) P a(x )b b b b • r Ø we use a linear activation− function at the output b a(x(t)) = b + Wx(t) b ( h(x(t)) = sigm(a(x(t))) (t) ( (t) a(x ) = c + W>h(x ) ( x(t) = sigm(a(x(t))) b ( (t) (t) b (t) b (t) l(f(x )) = x x ra(x ) ( − (t) (t) b cl(f(x )) = a(x(t))l(f(x )) r ( rb (t) (t) (t) l(f(x )) = W (t) l(f(x )) rh(x ) ( b ra(x ) (t) ⇣ (t) ⌘ (t) (t) a(x(t))l(f(x )) = h(x(bt))l(f(x )) [...,h(x )j(1 h(x )j),...] r 1 ( r − (t) ⇣ (t) ⌘ l(f(x )) = (t) l(f(x )) rb ( ra(x ) (t) (t) (t) (t) (t) > l(f(x )) = (t) l(f(x )) x > + h(x ) (t) l(f(x )) rW ( ra(x ) ra(x ) ⇣ ⌘ ⇣ ⌘ b 1 W⇤ = W> • 1 Autoencoder Linear Features z • If the hidden and output layers are linear, it will learn hidden units that are a linear funcAon of the data and minimize the squared error. Wz z=Wx • The K hidden units will span the same space as the first k principal components. The weight vectors Input Image x may not be orthogonal. • With nonlinear hidden units, we have a nonlinear generalizaon of PCA. l(f(x(t))) W • rW p(x µ) µ • | µ h(x) • l(fl((fx((xt))))) =W log p(x µ) • rW• − | p(x µ)= 1 exp( 1 (x µ )2) µ = c + W h(x) p(x µ) µ (2⇡)D/2 2 k k k ⇤ • |• | − − P µ h(xAA) = U ⌃ V> U , k ⌃ k, k V>, k B • • · · l(f(x)) = log p(x µ) • • − | B⇤ = arg min A B F 1 1 2 p(x µ)= exp( (x µ ) ) µ = c + WB⇤hs(.tx. )rank(B)=k || − || • | (2⇡)D/2 − 2 k k − k B⇤ = U , k ⌃ k,Pk V>, k AA= U ⌃ V>· U ,k ⌃ k, · kV>, k B • · · 1 (t) (t) 2 arg min (xk xk ) arg min X W⇤h(X) F • ✓ 2 − ≥ W ,h(X) || − || Bt ⇤ =k arg min A B F⇤ X XB s.t. rank(B)=k || − || b (t) B⇤ =xU ,Xk =⌃ Uk,⌃kVV>>, k • · · arg min X 1 W⇤h(tX) ) F(t) = W⇤ U , k ⌃ k, k, h(X) V>, k 2 · Warg⇤,h min(X) || − (xk ||xk ) arg min X W⇤h(X) F · ✓ 2 − ≥ || − || t k W⇤,h(X) X X h(X)=V>, k b · 1 (t) = V>, k (X> X)− (X> X) x X = U ⌃ V> · • 1 = V>, k (V ⌃> U> U ⌃ V>)− (V ⌃> U> X) · arg min X W⇤h(X) F = W⇤ U , k ⌃ k, k, h(X)1 V>, k || − || · · W⇤,h(X) = V>, k (V ⌃> ⌃ V>)− V ⌃> U> X · 1 h(X)=V>,=k V>, k V (⌃> ⌃)− V> V ⌃> U> X · · 1 1 = V>,=k (XV>> X)V− (⌃(X>>⌃X)−) ⌃> U> X · , k · 1 = V>, k (V ⌃> U> U ⌃1 V>)− (V ⌃> U> X) · =I k, (⌃> ⌃)− ⌃> U> X · 1 = V> (V ⌃> ⌃ 1V>)− V1 ⌃> U> X ,=Ik k, ⌃− (⌃>)− ⌃> U> X · · 1 1 = V>,=Ik V (k,⌃>⌃⌃−)−UV> >XV ⌃> U> X · · 1 1 = V>,=k V⌃−(⌃> ⌃()U− ,⌃k>)>UX> X · k, k · 1 =Ik, (⌃> ⌃)− ⌃> U> X · 1 1 =Ik, ⌃− (⌃>)− ⌃> U> X 1 · Denoisingx =(U Autoencoder, k ⌃ k, k) h(x ) h(x)= ⌃−k, k (U , k)1> xW⇤ W • · =I k, · ⌃− U> X · ⇣ 1 ⌘ • Idea: representation (shouldt) be1 robust (tot) introduction1 T of noise:(t0 ) = ⌃−k, k (U , k)> X xb x x · • pT − T t0=1 Ø random assignment of subset of ⇣ ⌘ inputs to 0, with pprobability(x x) ⌫ x P 1 Ø Similar to dropouts• on| the input layer x =(U , k ⌃ k, k) h(x) h(x)= ⌃−k, k (U , k)> xW⇤ W • x· = sigm( c + W⇤h(x)) · Ø Gaussian additive• noisee e ⇣ ⌘ (t) 1 (t)(t) 1 T (t(0t)) 2 xb l(f(x x)) + λ x(t) hx(x ) • • pT − T||r t0=1 ||F • Reconstruction x computedb e ⇣ P ⌘ from the corruptedp(x x )input⌫ x 2 • |• (t) (t) 2 @h(x )j • Loss function compares x x(t) h(x ) = e e ||r ||F (t) reconstruction with the noiseless j k @xk ! input x X X (Vincent et al., ICML 2008) l(f(x(t))) = x(t) log(x(t))+(1 x(t)) log(1 x(t)) • − k k k − k − k P ⇣ ⌘ b b 2 2 (t) Wl(f(xExtracting)) W and Composing Robust Features with Denoising Autoencoders • r p(X X)=p(x µq )(Xµ X).

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    68 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us