COMP 551 – Applied Machine Learning Lecture 16: Deep Learning

COMP 551 – Applied Machine Learning Lecture 16: Deep Learning Instructor: Joelle Pineau ([email protected]) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission. The deep learning objective Figure 1: We would like the raw input image to be transformedinto gradually higher levels of representation, COMPrepresenting-551: Applied more andMachine more abstractLearning functions of the raw input, e.g.,2 edges, local shapes, objectJoelle parts, Pineau etc. In practice, we do not know in advance what the “right” representation should be for all these levels of abstractions, although linguistic concepts might help guessing what the higher levels should implicitly represent. 3 Learning an autoencoder function • Goal: Learn a compressed representation of the input data. f g • We have two functions: – Encoder: h = fW(x) = sf (Wx) – Decoder: x’ = gW’(h) = sg (W’h) where s() can be a sigmoid, linear, or other function and W, W’ are weight matrices. x h x’ COMP-551: Applied Machine Learning 3 Joelle Pineau Learning an autoencoder function • Goal: Learn a compressed representation of the input data. f g • We have two functions: – Encoder: h = fW(x) = sf (Wx) – Decoder: x’ = gW’(h) = sg (W’h) where s() can be a sigmoid, linear, or other function and W, W’ are weight matrices. • To train, minimize reconstruction error: x h x’ Err(W,W’) = ∑i=1:n L [ xi , gW’ (fW(xi)) ] using squared-error loss (continuous inputs) or cross-entropy (binary inputs). COMP-551: Applied Machine Learning 4 Joelle Pineau PCA vs autoencoders In the case of a linear function: f g fW(x) = Wx gŴ(h) = W’h , with squared-error loss: 2 Err(W,W’) = ∑i=1:n || xi – gW’ ( fW(xi ) ) || we can show that the minimum error solution W yields the same subspace as PCA. x h x’ COMP-551: Applied Machine Learning 5 Joelle Pineau Stacked autoencoders Key idea: Apply greedy layerwise unsupervised pre-training. http://www.dmi.usherb.ca/~larocheh/projects_deep_learning.html COMP-551: Applied Machine Learning 6 Joelle Pineau Regularization of autoencoders • How can we generate sparse autoencoders? (And also, why?) COMP-551: Applied Machine Learning 7 Joelle Pineau Regularization of autoencoders • How can we generate sparse autoencoders? (And also, why?) • Weight tying of the encoder and decoder weights (W=W’) to explicitly constrain (regularize) the learned function. COMP-551: Applied Machine Learning 8 Joelle Pineau Regularization of autoencoders • How can we generate sparse autoencoders? (And also, why?) • Weight tying of the encoder and decoder weights (W=W’) to explicitly constrain (regularize) the learned function. • Directly penalize the output of the hidden units (e.g. with L1 penalty) to introduce sparsity in the weights. COMP-551: Applied Machine Learning 9 Joelle Pineau Regularization of autoencoders • How can we generate sparse autoencoders? (And also, why?) • Weight tying of the encoder and decoder weights (W=W’) to explicitly constrain (regularize) the learned function. • Directly penalize the output of the hidden units (e.g. with L1 penalty) to introduce sparsity in the weights. • Penalize the average output (over a batch of data) to encourage it to approach a fixed target. COMP-551: Applied Machine Learning 10 Joelle Pineau Extracting and Composing Robust Features with Denoising Autoencoders 2.3. The Denoising Autoencoder Extracting and Composingtowards Robust reconstructing Features with the Denoisinguncorrupted Autoencoders version from the corrupted version. Note that in this way, the au- To test our hypothesis2.3. and The enforce Denoising robustness Autoencoder to par- towards reconstructing the uncorrupted version from toencoder cannot learn the identity, unlike the basic tially destroyed inputs we modify the basic autoen- the corrupted version. Note that in this way, the au- ExtractingTo test and our Composing hypothesis and Robust enforce Features robustnessautoencoder, with to Denoising par- thus Autoencoders removing the constraint that d <d coder we just described. We will now train it to recon- toencoder cannot learn the identity,0 unlike the basic tially destroyed inputs we modify theor basic the autoen- need to regularize specifically to avoid such a struct a clean “repaired” input from a corrupted, par- autoencoder, thus removing the constraint that d0 <d 2.3. The Denoisingcoder Autoencoder we just described. We will nowtowards traintrivial it reconstructing to solution. recon- the uncorrupted version from tially destroyed one. Thisstruct is a done clean by “repaired” first corrupting input fromthe a corrupted corrupted, par- version.or Note the that need in to this regularize way, the specifically au- to avoid such a theTo initial test ourinput hypothesisx to get and a partially enforce robustness destroyed to version par- trivial solution. tially destroyed inputstially we destroyedmodify the one. basic This autoen- is done bytoencoder first corrupting cannot learn the identity, unlike the basic x˜ by means of a stochastic mapping x˜ q (x˜ x). In 2.4. Layer-wise Initialization and Fine Tuning coder we just described.the We initial will input now trainx to it get to a recon- partiallyautoencoder, destroyed version thus removing the constraint that d0 <d ⇠ D | 2.4. Layer-wise Initialization and Fine Tuning ourstruct experiments, a clean “repaired” we consideredx˜ by input means from the of following acorrupted stochastic corrupt-, par-mappingorx˜The theq need basic(x˜ x to). autoencoder regularize In specifically has been to avoidused assuch a a building ing process, parameterized by the desired proportion ⌫ trivial⇠ solution.D | tially destroyed one.our This experiments, is done by first we considered corrupting the followingblock to corrupt- train deepThe networks basic autoencoder (Bengio et al., has 2007), been used with as a building of “destruction”:the initial input forx to each geting aprocess, input partiallyx parameterized, a destroyed fixed number version by the⌫d desiredthe proportion representation⌫ block of the to traink-th deep layer networks used as (Bengio input et for al., 2007), with 2.4. Layer-wise Initialization and Fine Tuning of componentsx˜ by means of are a stochastic chosenof “destruction”: at mapping random,x˜ and forq each their(x˜ x input). value In x, a fixedthe number (k + 1)-th,⌫d andthe the representation (k + 1)-th of layer the k trained-th layer after used as input for of components are⇠ chosenD | at random, and their value is forcedour experiments, to 0, while we the considered others are the left following untouched. corrupt- All Thethe basick-th autoencoder has beenthe trained. has (k + been 1)-th, After used and a fewas the a layers building (k + have 1)-th been layer trained after is forced to 0, while the others are left untouched. All informationing process, about parameterized the chosen by the components desired proportion is thus⌫ re- blocktrained, to train the deep parameters networksthe k-th (Bengio are has used been et al., as trained. initialization 2007), After with a few for layers a have been of “destruction”: for eachinformation input x, about a fixed the number chosen⌫d components is thus re- moved from that particuler input pattern, and the au- thenetwork representation optimized of thetrained, withk-th respect the layer parameters used to a as supervised input are used for as train- initialization for a toencoderof components will be are trained chosenmoved to at “fill-in” from random, that these and particuler their artificially value input pattern,the (k and+ 1)-th, the au- and thenetwork (k + 1)-th optimized layer trained with respect after to a supervised train- is forced to 0, while thetoencoder others are will left be untouched. trained to All “fill-in” theseing criterion. artificially This greedy layer-wise procedure has introduced “blanks”. Note that alternative corrupting thebeenk-th has shown been to trained. yielding significantly criterion. After a few This layers better greedy have local been layer-wise minima procedure has noisesinformation could be about considered theintroduced chosen1. The components “blanks”. corrupted is Note input thus that re-x˜ alternativeis trained, corrupting the parametersbeen are shown used as to initialization yield significantly for a better local minima 1 than random initialization of deep networks , achieving moved from that particulernoises input could pattern, be considered and the. au- The corruptednetwork input optimizedx˜ is with respect to a supervised train- then mapped, as with the basic autoencoder, to a hid- better generalizationthan on random a number initialization of tasks of (Larochelle deep networks , achieving toencoder will be trainedthen to mapped, “fill-in” as these with artificially the basic autoencoder,ing criterion. to a hid- This greedy layer-wise procedure has den representation y = f✓(x˜)=s(Wx˜+b) from which better generalization on a number of tasks (Larochelle introduced “blanks”. Noteden representation that alternativey = corruptingf✓(x˜)=s(Wbeenx˜+etb) shown al., from 2007). which to yield significantly better local minima we reconstruct a z = g (y)=1 s(W y + b ) (see figure et al., 2007). noises could be considered✓we0 reconstruct. The corrupted0 a z =0 g input(y)=x˜ iss(W y + b ) (see figure 1 for a schematic representation of the process).✓0 As 0 thanThe random0 procedure initialization to train of deep a deep networks network , achieving using the de- then mapped, as with1 the

Load more