Autoencoders and Self-Supervision

COMP 451 – Fundamentals of Machine Learning Lecture 25 --- Autoencoders and self-supervision William L. Hamilton * Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission. William L. Hamilton, McGill University and Mila 1 Autoencoders and self-supervision § Two approaches to dimensionality reduction using deep learning. § This is a rough categorization and not a strict division!! § Autoencoders: § Optimize a “reconstruction loss.” § Encoder maps input to a low-dimensional space and decoder tries to recover the original data from the low-dimensional space. § “Self-supervision”: § Try to predict some parts of the input from other parts of the input. § I.e., make up labels from x. William L. Hamilton, McGill University and Mila 2 Autoencoders and self-supervision § Two approaches to dimensionality reduction using deep learning. § This is a rough categorization and not a strict division!! § Autoencoders: § Optimize a “reconstruction loss.” § Encoder maps input to a low-dimensional space and decoder tries to recover the original data from the low-dimensional space. § “Self-supervision”: § Try to predict some parts of the input from other parts of the input. § I.e., make up labels from x. William L. Hamilton, McGill University and Mila 3 Autoencoding: the basic idea Image credit: https://lilianweng.github.io/lil-log/2018/08/12/from-autoencoder-to-beta-vae.html William L. Hamilton, McGill University and Mila 4 Learning an autoencoder function § Goal: Learn a compressed representation of the input data. § We have two functions (usually neural networks): § Encoder: z = gφ(x) § Decoder: xˆ = f✓(z) William L. Hamilton, McGill University and Mila 5 Learning an autoencoder function § Goal: Learn a compressed representation of the input data. § We have two functions (usually neural networks): § Encoder: z = gφ(x) § Decoder: xˆ = f✓(z) § Train using a reconstruction loss: J(x, xˆ)= x xˆ 2 k − k = x f (g (x)) 2 k − ✓ φ k William L. Hamilton, McGill University and Mila 6 Learning an autoencoder function § Goal: Learn a compressed representation of the input data. § We have two functions (usually neural networks): § Encoder: z = gφ(x) § Decoder: Only interesting when z has xˆ = f (z) much smaller dimension ✓ than x! § Train using a reconstruction loss: J(x, xˆ)= x xˆ 2 k − k = x f (g (x)) 2 k − ✓ φ k William L. Hamilton, McGill University and Mila 7 Autoencoding Image credit: https://lilianweng.github.io/lil-log/2018/08/12/from-autoencoder-to-beta-vae.html William L. Hamilton, McGill University and Mila 8 Recall: Principal Component Analysis (PCA) § Idea: Project data into a lower-dimensional sub-space, Rm -->Rm’, where m’<m. William L. Hamilton, McGill University and Mila 9 Recall: Principal Component Analysis (PCA) § Idea: Project data into a lower-dimensional sub-space, Rm -->Rm’, where m’<m. T § Consider a linear mapping, xi --> W xi § W is the compression matrix with dimension Rmxm’. § Assume there is a decompression matrix Um’xm. William L. Hamilton, McGill University and Mila 10 Recall: Principal Component Analysis (PCA) § Idea: Project data into a lower-dimensional sub-space, Rm -->Rm’, where m’<m. T § Consider a linear mapping, xi --> W xi § W is the compression matrix with dimension Rmxm’. § Assume there is a decompression matrix Um’xm. T 2 § Solve the following problem: argminW,U Σi=1:n || xi – UW xi || William L. Hamilton, McGill University and Mila 11 Recall: Principal Component Analysis (PCA) T 2 § Solve the following problem: argminW,U Σi=1:n || xi – UW xi || T 2 § Equivalently: argminW,U || X – XWU || § Solution is given by eigen-decomposition of XTX. § W is mxm’ matrix corresponding to the first m’ eigenvectors of XTX (sorted in descending order by the magnitude of the eigenvalue). § Equivalently: W is mxm’ matrix containing the first m’ left singular vectors of X § Note: The columns of W are orthogonal! William L. Hamilton, McGill University and Mila 12 PCA vs autoencoders In the case of a linear encoders and decoders: fW(x) = Wx gŴ(h) = W’h , with squared-error reconstruction loss we can show that the minimum error solution W yields the same subspace as PCA. William L. Hamilton, McGill University and Mila 13 More advanced encoders and decoders § What to use as encoders and decoders? § Most data (e.g., arbitrary real-valued or categorical features). § Encoder and decoder are feed-forward neural networks. § Sequence data § Encoder and decoder are RNNs. § Image data § Encoder is a CNN; decoder is a deconvolutional network. William L. Hamilton, McGill University and Mila 14 Aside: Deconvolutions § “Deconvolution” is just a transposed convolution. William L. Hamilton, McGill University and Mila 15 Regularization of autoencoders § How can we generate sparse autoencoders? (And also, why?) William L. Hamilton, McGill University and Mila 16 Regularization of autoencoders § How can we generate sparse autoencoders? (And also, why?) § Weight tying of the encoder and decoder weights (� = �) to explicitly constrain (regularize) the learned function. William L. Hamilton, McGill University and Mila 17 Regularization of autoencoders § How can we generate sparse autoencoders? (And also, why?) § Weight tying of the encoder and decoder weights (� = �) to explicitly constrain (regularize) the learned function. § Directly penalize the output of the hidden units (e.g. with L1 penalty) to introduce sparsity in the weights. William L. Hamilton, McGill University and Mila 18 Regularization of autoencoders § How can we generate sparse autoencoders? (And also, why?) § Weight tying of the encoder and decoder weights (� = �) to explicitly constrain (regularize) the learned function. § Directly penalize the output of the hidden units (e.g. with L1 penalty) to introduce sparsity in the weights. § Penalize the average output (over a batch of data) to encourage it to approach a fixed target. William L. Hamilton, McGill University and Mila 19 Denoising autoencoders William L. Hamilton, McGill University and Mila 20 Extracting and Composing Robust Features with Denoising Autoencoders 2.3. The Denoising Autoencoder Extracting and Composingtowards Robust reconstructing Features with the Denoisinguncorrupted Autoencoders version from the corrupted version. Note that in this way, the au- To test our hypothesis2.3. and The enforce Denoising robustness Autoencoder to par- towards reconstructing the uncorrupted version from toencoder cannot learn the identity, unlike the basic tially destroyed inputs we modify the basic autoen- the corrupted version. Note that in this way, the au- ExtractingTo test and our Composing hypothesis and Robust enforce Features robustnessautoencoder, with to Denoising par- thus Autoencoders removing the constraint that d <d coder we just described. We will now train it to recon- toencoder cannot learn the identity,0 unlike the basic tially destroyed inputs we modify theor basic the autoen- need to regularize specifically to avoid such a struct a clean “repaired” input from a corrupted, par- autoencoder, thus removing the constraint that d0 <d 2.3. The Denoisingcoder Autoencoder we just described. We will nowtowards traintrivial it reconstructing to solution. recon- the uncorrupted version from tially destroyed one. Thisstruct is a done clean by “repaired” first corrupting input fromthe a corrupted corrupted, par- version.or Note the that need in to this regularize way, the specifically au- to avoid such a theTo initial test ourinput hypothesisx to get and a partially enforce robustness destroyed to version par- trivial solution. tially destroyed inputstially we destroyedmodify the one. basic This autoen- is done bytoencoder first corrupting cannot learn the identity, unlike the basic x˜ by means of a stochastic mapping x˜ q (x˜ x). In 2.4. Layer-wise Initialization and Fine Tuning coder we just described.the We initial will input now trainx to it get to a recon- partiallyautoencoder, destroyed version thus removing the constraint that d0 <d ⇠ D | 2.4. Layer-wise Initialization and Fine Tuning ourstruct experiments, a clean “repaired” we consideredx˜ by input means from the of following acorrupted stochastic corrupt-, par-mappingorx˜The theq need basic(x˜ x to). autoencoder regularize In specifically has been to avoidused assuch a a building ing process, parameterized by the desired proportion ⌫ trivial⇠ solution.D | tially destroyed one.our This experiments, is done by first we considered corrupting the followingblock to corrupt- train deepThe networks basic autoencoder (Bengio et al., has 2007), been used with as a building of “destruction”:the initial input forx to each geting aprocess, input partiallyx parameterized, a destroyed fixed number version by the⌫d desiredthe proportion representation⌫ block of the to traink-th deep layer networks used as (Bengio input et for al., 2007), with 2.4. Layer-wise Initialization and Fine Tuning of componentsx˜ by means of are a stochastic chosenof “destruction”: at mapping random,x˜ and forq each their(x˜ x input). value In x, a fixedthe number (k + 1)-th,⌫d andthe the representation (k + 1)-th of layer the k trained-th layer after used as input for of components are⇠ chosenD | at random, and their value is forcedour experiments, to 0, while we the considered others are the left following untouched. corrupt- All Thethe basick-th autoencoder has beenthe trained. has (k + been 1)-th, After used and a fewas the a layers building (k + have 1)-th been layer trained after is forced to 0, while

Load more