Deep Learning II Unsupervised Learning

Deep Learning II Unsupervised Learning Russ Salakhutdinov Machine Learning Department Carnegie Mellon University Canadian Institute of Advanced Research Talk Roadmap Part 1: Supervised Learning: Deep Networks Part 2: Unsupervised Learning: Learning Deep Generave Models Part 3: Open Research QuesAons Unsupervised Learning Non-probabilisAc Models ProbabilisAc (Generave) Ø Sparse Coding Models Ø Autoencoders Ø Others (e.g. k-means) Non-Tractable Models Tractable Models Ø Generave Adversarial Ø Fully observed Ø Boltzmann Machines Networks Ø Belief Nets Variaonal Ø Moment Matching Ø NADE Autoencoders Networks Ø PixelRNN Ø Helmholtz Machines Ø Many others… Explicit Density p(x) Implicit Density Talk Roadmap • Basic Building Blocks: Ø Sparse Coding Ø Autoencoders • Deep Generave Models Ø Restricted Boltzmann Machines Ø Deep Belief Networks and Deep Boltzmann Machines Ø Helmholtz Machines / Variaonal Autoencoders • Generave Adversarial Networks • Model Evaluaon Sparse Coding • Sparse coding (Olshausen & Field, 1996). Originally developed to explain early visual processing in the brain (edge detecAon). • Objecve: Given a set of input data vectors learn a dicAonary of bases such that: Sparse: mostly zeros • Each data vector is represented as a sparse linear combinaon of bases. Sparse Coding Natural Images Learned bases: “Edges” New example = 0.8 * + 0.3 * + 0.5 * x = 0.8 * + 0.3 * + 0.5 * [0, 0, … 0.8, …, 0.3, …, 0.5, …] = coefficients (feature representaon) Slide Credit: Honglak Lee Sparse Coding: Training • Input image patches: • Learn dicAonary of bases: ReconstrucAon error Sparsity penalty • Alternang OpAmizaon: 1. Fix dicAonary of bases and solve for acAvaons a (a standard Lasso problem). 2. Fix acAvaons a, opAmize the dicAonary of bases (convex QP problem). Sparse Coding: TesAng Time • Input: a new image patch x* , and K learned bases • Output: sparse representaon a of an image patch x*. = 0.8 * + 0.3 * + 0.5 * x* = 0.8 * + 0.3 * + 0.5 * [0, 0, … 0.8, …, 0.3, …, 0.5, …] = coefficients (feature representaon) Image Classificaon Evaluated on Caltech101 object category dataset. Classification Algorithm (SVM) Learned Input Image bases Features (coefficients) 9K images, 101 classes Algorithm Accuracy Baseline (Fei-Fei et al., 2004) 16% PCA 37% Sparse Coding 47% Slide Credit: Honglak Lee (Lee, Battle, Raina, Ng, NIPS 2007) InterpreAng Sparse Coding a Sparse features a Implicit g(a) Explicit f(x) Linear nonlinear Decoding encoding x’ x • Sparse, over-complete representaon a. • Encoding a = f(x) is implicit and nonlinear funcAon of x. • ReconstrucAon (or decoding) x’ = g(a) is linear and explicit. Autoencoder Feature Representation Feed-back, Feed-forward, generative, bottom-up top-down Decoder Encoder path Input Image • Details of what goes insider the encoder and decoder maer! • Need constraints to avoid learning an idenAty. Autoencoder Binary Features z Decoder Encoder filters D filters W. Dz z=σ(Wx) Linear Sigmoid function function path Input Image x Autoencoder Binary Features z Encoder filters W. σ(WTz) z=σ(Wx) Decoder Sigmoid filters D function path Binary Input x • Need addiAonal constraints to avoid learning an idenAty. • Relates to Restricted Boltzmann Machines (later). Autoencoders Hugo Larochelle Departement´ d’informatique Universite´ de Sherbrooke [email protected] October 16, 2012 Abstract Math for my slides “Autoencoders”. Autoencoders • Hugo Larochelle AutoencodersDepartement´ h(x)= d’informatique g(a(x)) Universite´ de Sherbrooke = sigm(b + Wx) • Feed-forward neural network trained to reproduce its input at the output [email protected] layer • Decoder October 16, 2012 x = o(a(x)) = sigm(c + W h(x)) Abstract ⇤ b b f(Mathx) forx myl(f slides(x)) “Autoencoders”. = (x x )2 l(f(x)) = For binary units (x log(x )+(1 x ) log(1 x )) • ⌘ k k − k − k k k − k − k P P • b b Encoder b b h(x)=g(a(x)) = sigm(b + Wx) • x = o(a(x)) = sigm(c + W⇤h(x)) b b f(x) x l(f(x)) = (x x )2 l(f(x)) = (x log(x )+(1 x ) log(1 x )) • ⌘ k k − k − k k k − k − k P P b b b b 1 1 Autoencoders Autoencoders Hugo Larochelle Hugo Larochelle Autoencoders Departement´ d’informatique Departement´ d’informatique Universite´ de Sherbrooke Universite´ de Sherbrooke Hugo Larochelle [email protected] [email protected]´ d’informatique October 16, 2012 Universite´ de Sherbrooke October 16, 2012 [email protected] Abstract October 16, 2012 Abstract Math for my slides “Autoencoders”. Math for my slides “Autoencoders”. • •Abstract h(x)=g(b + Wx) Math for my slides “Autoencoders”. h(x)=g(b + Wx) = sigm(b + Wx) = sigm(b + Wx) • • h(x)=g(a(x)) x = o(cLoss+ W⇤h (Functionx)) • = sigm(b + Wx) = sigm(c + W⇤h(x)) x = o(c + W h(x)) • Loss function for binary inputs ⇤ b = sigm(c + W⇤h(x)) 2 f(x) x l(f(x)) = (xk xk) l(f(x)) = (xk log(xk)+(1 xk) log(1 xk)) • ⌘ k• − − k − − b 2 P Ø Cross-entropyP error function (reconstruction loss) f(x) x l(f(x)) = (x x ) b b b x = • bo(a(x⌘)) k k − k = sigm(c + W⇤h(x)) P b b • Loss function for real-valued inputs b b f(x) x l(f(x)) = 1 (x x )2 l(f(x)) = (x log(x )+(1 x ) log(1 x )) • ⌘ 2 k k − k − k k k − k − k Ø (t) (t) (t) (t) l(f(xsum)) of =squaredx P differencesx (reconstruction loss) P a(x )b b b b • r Ø we use a linear activation− function at the output b a(x(t)) = b + Wx(t) b ( h(x(t)) = sigm(a(x(t))) (t) ( (t) a(x ) = c + W>h(x ) ( x(t) = sigm(a(x(t))) b ( (t) (t) b (t) b (t) l(f(x )) = x x ra(x ) ( − (t) (t) b cl(f(x )) = a(x(t))l(f(x )) r ( rb (t) (t) (t) l(f(x )) = W (t) l(f(x )) rh(x ) ( b ra(x ) (t) ⇣ (t) ⌘ (t) (t) a(x(t))l(f(x )) = h(x(bt))l(f(x )) [...,h(x )j(1 h(x )j),...] r 1 ( r − (t) ⇣ (t) ⌘ l(f(x )) = (t) l(f(x )) rb ( ra(x ) (t) (t) (t) (t) (t) > l(f(x )) = (t) l(f(x )) x > + h(x ) (t) l(f(x )) rW ( ra(x ) ra(x ) ⇣ ⌘ ⇣ ⌘ b 1 W⇤ = W> • 1 Autoencoder Linear Features z • If the hidden and output layers are linear, it will learn hidden units that are a linear funcAon of the data and minimize the squared error. Wz z=Wx • The K hidden units will span the same space as the first k principal components. The weight vectors Input Image x may not be orthogonal. • With nonlinear hidden units, we have a nonlinear generalizaon of PCA. l(f(x(t))) W • rW p(x µ) µ • | µ h(x) • l(fl((fx((xt))))) =W log p(x µ) • rW• − | p(x µ)= 1 exp( 1 (x µ )2) µ = c + W h(x) p(x µ) µ (2⇡)D/2 2 k k k ⇤ • |• | − − P µ h(xAA) = U ⌃ V> U , k ⌃ k, k V>, k B • • · · l(f(x)) = log p(x µ) • • − | B⇤ = arg min A B F 1 1 2 p(x µ)= exp( (x µ ) ) µ = c + WB⇤hs(.tx. )rank(B)=k || − || • | (2⇡)D/2 − 2 k k − k B⇤ = U , k ⌃ k,Pk V>, k AA= U ⌃ V>· U ,k ⌃ k, · kV>, k B • · · 1 (t) (t) 2 arg min (xk xk ) arg min X W⇤h(X) F • ✓ 2 − ≥ W ,h(X) || − || Bt ⇤ =k arg min A B F⇤ X XB s.t. rank(B)=k || − || b (t) B⇤ =xU ,Xk =⌃ Uk,⌃kVV>>, k • · · arg min X 1 W⇤h(tX) ) F(t) = W⇤ U , k ⌃ k, k, h(X) V>, k 2 · Warg⇤,h min(X) || − (xk ||xk ) arg min X W⇤h(X) F · ✓ 2 − ≥ || − || t k W⇤,h(X) X X h(X)=V>, k b · 1 (t) = V>, k (X> X)− (X> X) x X = U ⌃ V> · • 1 = V>, k (V ⌃> U> U ⌃ V>)− (V ⌃> U> X) · arg min X W⇤h(X) F = W⇤ U , k ⌃ k, k, h(X)1 V>, k || − || · · W⇤,h(X) = V>, k (V ⌃> ⌃ V>)− V ⌃> U> X · 1 h(X)=V>,=k V>, k V (⌃> ⌃)− V> V ⌃> U> X · · 1 1 = V>,=k (XV>> X)V− (⌃(X>>⌃X)−) ⌃> U> X · , k · 1 = V>, k (V ⌃> U> U ⌃1 V>)− (V ⌃> U> X) · =I k, (⌃> ⌃)− ⌃> U> X · 1 = V> (V ⌃> ⌃ 1V>)− V1 ⌃> U> X ,=Ik k, ⌃− (⌃>)− ⌃> U> X · · 1 1 = V>,=Ik V (k,⌃>⌃⌃−)−UV> >XV ⌃> U> X · · 1 1 = V>,=k V⌃−(⌃> ⌃()U− ,⌃k>)>UX> X · k, k · 1 =Ik, (⌃> ⌃)− ⌃> U> X · 1 1 =Ik, ⌃− (⌃>)− ⌃> U> X 1 · Denoisingx =(U Autoencoder, k ⌃ k, k) h(x ) h(x)= ⌃−k, k (U , k)1> xW⇤ W • · =I k, · ⌃− U> X · ⇣ 1 ⌘ • Idea: representation (shouldt) be1 robust (tot) introduction1 T of noise:(t0 ) = ⌃−k, k (U , k)> X xb x x · • pT − T t0=1 Ø random assignment of subset of ⇣ ⌘ inputs to 0, with pprobability(x x) ⌫ x P 1 Ø Similar to dropouts• on| the input layer x =(U , k ⌃ k, k) h(x) h(x)= ⌃−k, k (U , k)> xW⇤ W • x· = sigm( c + W⇤h(x)) · Ø Gaussian additive• noisee e ⇣ ⌘ (t) 1 (t)(t) 1 T (t(0t)) 2 xb l(f(x x)) + λ x(t) hx(x ) • • pT − T||r t0=1 ||F • Reconstruction x computedb e ⇣ P ⌘ from the corruptedp(x x )input⌫ x 2 • |• (t) (t) 2 @h(x )j • Loss function compares x x(t) h(x ) = e e ||r ||F (t) reconstruction with the noiseless j k @xk ! input x X X (Vincent et al., ICML 2008) l(f(x(t))) = x(t) log(x(t))+(1 x(t)) log(1 x(t)) • − k k k − k − k P ⇣ ⌘ b b 2 2 (t) Wl(f(xExtracting)) W and Composing Robust Features with Denoising Autoencoders • r p(X X)=p(x µq )(Xµ X).

Deep Learning II Unsupervised Learning

Deep Belief Networks for Phone Recognition

A Survey Paper on Deep Belief Network for Big Data

Auto-Encoding a Knowledge Graph Using a Deep Belief Network

Unsupervised Pre-Training of a Deep LSTM-Based Stacked Autoencoder for Multivariate Time Series Forecasting Problems Alaa Sagheer 1,2,3* & Mostafa Kotb2,3

A Video Recognition Method by Using Adaptive Structural Learning Of

Deep Reinforcement Learning with Experience Replay Based on SARSA

Deep Belief Networks Based Feature Generation and Regression For

Deep Belief Networks

Non-Destructive Detection of Tea Leaf Chlorophyll Content Using Hyperspectral Reﬂectance and Machine Learning Algorithms

A Deep Belief Network Classification Approach for Automatic

The Effects of Deep Belief Network Pre-Training of a Multilayered Perceptron Under Varied Labeled Data Conditions

A Deep Belief Network Approach to Learning Depth from Optical Flow