Deep Learning Unsupervised Learning

Deep Learning Unsupervised Learning Russ Salakhutdinov Machine Learning Department Carnegie Mellon University Canadian Institute for Advanced Research Tutorial Roadmap Part 1: Supervised (Discriminave) Learning: Deep Networks Part 2: Unsupervised Learning: Deep Generave Models Part 3: Open Research Ques>ons Unsupervised Learning Non-probabilis>c Models Probabilis>c (Generave) Ø Sparse Coding Models Ø Autoencoders Ø Others (e.g. k-means) Non-Tractable Models Tractable Models Ø Generave Adversarial Ø Fully observed Ø Boltzmann Machines Networks Ø Belief Nets Variaonal Ø Moment Matching Ø NADE Autoencoders Networks Ø PixelRNN Ø Helmholtz Machines Ø Many others… Explicit Density p(x) Implicit Density Tutorial Roadmap • Basic Building Blocks: Ø Sparse Coding Ø Autoencoders • Deep Generave Models Ø Restricted Boltzmann Machines Ø Deep Boltzmann Machines Ø Helmholtz Machines / Variaonal Autoencoders • Generave Adversarial Networks Tutorial Roadmap • Basic Building Blocks: Ø Sparse Coding Ø Autoencoders • Deep Generave Models Ø Restricted Boltzmann Machines Ø Deep Boltzmann Machines Ø Helmholtz Machines / Variaonal Autoencoders • Generave Adversarial Networks Sparse Coding • Sparse coding (Olshausen & Field, 1996). Originally developed to explain early visual processing in the brain (edge detec>on). • Objecve: Given a set of input data vectors learn a dic>onary of bases such that: Sparse: mostly zeros • Each data vector is represented as a sparse linear combinaon of bases. Sparse Coding Natural Images Learned bases: “Edges” New example = 0.8 * + 0.3 * + 0.5 * x = 0.8 * + 0.3 * + 0.5 * [0, 0, … 0.8, …, 0.3, …, 0.5, …] = coefficients (feature representaon) Slide Credit: Honglak Lee Sparse Coding: Training • Input image patches: • Learn dic>onary of bases: Reconstruc>on error Sparsity penalty • Alternang Op>mizaon: 1. Fix dic>onary of bases and solve for ac>vaons a (a standard Lasso problem). 2. Fix ac>vaons a, op>mize the dic>onary of bases (convex QP problem). Sparse Coding: Tes>ng Time • Input: a new image patch x* , and K learned bases • Output: sparse representaon a of an image patch x*. = 0.8 * + 0.3 * + 0.5 * x* = 0.8 * + 0.3 * + 0.5 * [0, 0, … 0.8, …, 0.3, …, 0.5, …] = coefficients (feature representaon) Image Classificaon Evaluated on Caltech101 object category dataset. Classification Algorithm (SVM) Learned Input Image bases Features (coefficients) 9K images, 101 classes Algorithm Accuracy Baseline (Fei-Fei et al., 2004) 16% PCA 37% Sparse Coding 47% Slide Credit: Honglak Lee Lee, Bale, Raina, Ng, 2006 Interpre>ng Sparse Coding a Sparse features a Implicit g(a) Explicit f(x) Linear nonlinear Decoding encoding x’ x • Sparse, over-complete representaon a. • Encoding a = f(x) is implicit and nonlinear func>on of x. • Reconstruc>on (or decoding) x’ = g(a) is linear and explicit. Autoencoder Feature Representation Feed-back, Feed-forward, generative, bottom-up top-down Decoder Encoder path Input Image • Details of what goes insider the encoder and decoder maer! • Need constraints to avoid learning an iden>ty. Autoencoder Binary Features z Decoder Encoder filters D filters W. Dz z=σ(Wx) Linear Sigmoid function function path Input Image x Autoencoder Binary Features z • An autoencoder with D inputs, D outputs, and K hidden units, with K<D. Dz z=σ(Wx) • Given an input x, its reconstruc>on is given by: Input Image x Decoder Encoder Autoencoder Binary Features z • An autoencoder with D inputs, D outputs, and K hidden units, with K<D. Dz z=σ(Wx) Input Image x • We can determine the network parameters W and D by minimizing the reconstruc>on error: Autoencoder Linear Features z • If the hidden and output layers are linear, it will learn hidden units that are a linear func>on of the data and minimize the squared error. Wz z=Wx • The K hidden units will span the same space as the first k principal components. The weight vectors Input Image x may not be orthogonal. • With nonlinear hidden units, we have a nonlinear generalizaon of PCA. Another Autoencoder Model Binary Features z Encoder filters W. σ(WTz) z=σ(Wx) Decoder Sigmoid filters D function path Binary Input x • Need addi>onal constraints to avoid learning an iden>ty. • Relates to Restricted Boltzmann Machines (later). Predic>ve Sparse Decomposi>on Binary Features z L Sparsity Encoder 1 filters W. Dz z=σ(Wx) Decoder Sigmoid filters D function path Real-valued Input x At training time path Decoder Encoder Kavukcuoglu, Ranzato, Fergus, LeCun, 2009 Stacked Autoencoders Class Labels Decoder Encoder Features Sparsity Decoder Encoder Features Sparsity Decoder Encoder Input x Stacked Autoencoders Class Labels Decoder Encoder Features Sparsity Decoder Encoder Features Greedy Layer-wise Learning. Sparsity Decoder Encoder Input x Stacked Autoencoders Class Labels • Remove decoders and use feed-forward part. Encoder • Standard, or Features convolu>onal neural network architecture. Encoder • Parameters can be Features fine-tuned using backpropagaon. Encoder Input x Deep Autoencoders Decoder 30 W 4 Top 500 RBM T T W 1 W 1 + 8 2000 2000 T T 500 W 2 W 2 + 7 1000 1000 W 3 1000 W T W T RBM 3 3 + 6 500 500 W T T 4 W 4 + 5 1000 30 Code layer 30 W 4 W 44+ W 2 2000 500 500 RBM W 3 W 33+ 1000 1000 W 2 W 22+ 2000 2000 2000 W 1 W 1 W 11+ RBM Encoder Pretraining Unrolling Finetuning Deep Autoencoders • 25x25 – 2000 – 1000 – 500 – 30 autoencoder to extract 30-D real- valued codes for Olive face patches. • Top: Random samples from the test dataset. • Middle: Reconstruc>ons by the 30-dimensional deep autoencoder. • Boom: Reconstruc>ons by the 30-dimen>noal PCA. Informaon Retrieval European Community Interbank Markets Monetary/Economic 2-D LSA space Energy Markets Disasters and Accidents Leading Legal/Judicial Economic Indicators Government Accounts/ Borrowings Earnings • The Reuters Corpus Volume II contains 804,414 newswire stories (randomly split into 402,207 training and 402,207 test). • “Bag-of-words” representaon: each ar>cle is represented as a vector containing the counts of the most frequently used 2000 words. (Hinton and Salakhutdinov, Science 2006) Tutorial Roadmap • Basic Building Blocks: Ø Sparse Coding Ø Autoencoders • Deep Generave Models Ø Restricted Boltzmann Machines Ø Deep Boltzmann Machines Ø Helmholtz Machines / Variaonal Autoencoders • Generave Adversarial Networks BRIEF ARTICLE THE AUTHOR Maximum likelihood ✓⇤Fully Observed Models = arg max Ex pdata log pmodel(x ✓) ✓ ⇠ | Fully-visible belief• Explicitly model condi>onal probabili>es: net n pmodel(x)=pmodel(x1) pmodel(xi x1,...,xi 1) | − Yi=2 Each condi>onal can be a complicated neural network • A number of successful models, including Ø NADE, RNADE (Larochelle, et.al. 20011) Ø Pixel CNN (van den Ord et. al. 2016) Ø Pixel RNN (van den Ord et. al. 2016) Pixel CNN 1 Restricted Boltzmann Machines hidden variables Pair-wise Unary Graphical Models: Powerful Feature Detectors framework for represen>ng dependency structure between random variables. Image visible variables RBM is a Markov Random Field with: • Stochas>c binary visible variables • Stochas>c binary hidden variables • Bipar>te connec>ons. Markov random fields, Boltzmann machines, log-linear models. Learning Features Observed Data Learned W: “edges” Subset of 25,000 characters Subset of 1000 features Sparse New Image: representaons = …. Logis>c Func>on: Suitable for modeling binary images RBMs for Real-valued & Count Data Learned features (out of 10,000) 4 million unlabelled images Learned features: ``topics’’ russian clinton computer trade stock Reuters dataset: russia house system country wall 804,414 unlabeled moscow president product import street newswire stories yeltsin bill sovware world point dow Bag-of-Words soviet congress develop economy Collaborave Filtering Binary hidden: user preferences h Learned features: ``genre’’ W1 Fahrenheit 9/11 Independence Day v Bowling for Columbine The Day Aver Tomorrow The People vs. Larry Flynt Con Air Mul>nomial visible: user rangs Canadian Bacon Men in Black II La Dolce Vita Men in Black Nexlix dataset: 480,189 users Friday the 13th Scary Movie The Texas Chainsaw Massacre Naked Gun 17,770 movies Children of the Corn Hot Shots! Over 100 million rangs Child's Play American Pie The Return of Michael Myers Police Academy State-of-the-art performance on the Nexlix dataset. (Salakhutdinov, Mnih, Hinton, ICML 2007) Different Data Modali>es • Binary/Gaussian/ Sovmax RBMs: All have binary hidden variables but use them to model different kinds of data. Binary 0 0 0 1 0 Real-valued 1-of-K • It is easy to infer the states of the hidden variables: Product of Experts The joint distribu>on is given by: Marginalizing over hidden variables: Product of Experts government clinton bribery mafia stock … authority house corrupon business wall power president dishonesty gang street empire bill corrupt mob point federaon congress fraud insider dow Topics “government”, ”corrupon” and ”mafia” can combine to give very high probability to a word “Silvio Silvio Berlusconi Berlusconi”. Product of Experts The joint distribu>on is given by: Marginalizing over hidden variables: Product of Experts 50 Replicated 40 Softmax 50−D government clinton bribery mafia stock … authority house 30 corrupon business wall street power president dishonesty LDA 50−D gang point empire bill 20 corrupt mob federaon congress fraud insider dow Precision (%) 10 Topics “government”, ”corrupon” and ”mafia” can combine to give very 0.001 0.006 0.051 0.4 1.6 6.4 25.6 100 Recall high probability to a word “(%) Silvio Silvio Berlusconi Berlusconi”. Deep Boltzmann Machines Low-level features: Edges Built from unlabeled inputs. Input: Pixels Image (Salakhutdinov 2008, Salakhutdinov & Hinton 2012) Deep Boltzmann Machines Learn simpler representaons, then compose more complex ones Higher-level features: Combinaon of edges Low-level features: Edges Built from unlabeled inputs. Input: Pixels Image (Salakhutdinov 2008, Salakhutdinov & Hinton 2009) Model Formulaon h3 Same as RBMs W3 model parameters h2 • Dependencies between hidden variables.

Load more