Autoencoders and Self-Supervision
COMP 451 – Fundamentals of Machine Learning Lecture 25 --- Autoencoders and self-supervision
William L. Hamilton * Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor’s written permission.
William L. Hamilton, McGill University and Mila 1 Autoencoders and self-supervision
§ Two approaches to dimensionality reduction using deep learning. § This is a rough categorization and not a strict division!!
§ Autoencoders: § Optimize a “reconstruction loss.” § Encoder maps input to a low-dimensional space and decoder tries to recover the original data from the low-dimensional space.
§ “Self-supervision”: § Try to predict some parts of the input from other parts of the input. § I.e., make up labels from x.
William L. Hamilton, McGill University and Mila 2 Autoencoders and self-supervision
§ Two approaches to dimensionality reduction using deep learning. § This is a rough categorization and not a strict division!!
§ Autoencoders: § Optimize a “reconstruction loss.” § Encoder maps input to a low-dimensional space and decoder tries to recover the original data from the low-dimensional space.
§ “Self-supervision”: § Try to predict some parts of the input from other parts of the input. § I.e., make up labels from x.
William L. Hamilton, McGill University and Mila 3 Autoencoding: the basic idea
Image credit: https://lilianweng.github.io/lil-log/2018/08/12/from-autoencoder-to-beta-vae.html William L. Hamilton, McGill University and Mila 4 Learning an autoencoder function
§ Goal: Learn a compressed representation of the input data. § We have two functions (usually neural networks): § Encoder: z = g (x) § Decoder:
xˆ = f✓(z)
William L. Hamilton, McGill University and Mila 5 Learning an autoencoder function
§ Goal: Learn a compressed representation of the input data. § We have two functions (usually neural networks): § Encoder: z = g (x) § Decoder:
xˆ = f✓(z)
§ Train using a reconstruction loss: J(x, xˆ)= x xˆ 2 k k = x f (g (x)) 2 k ✓ k William L. Hamilton, McGill University and Mila 6 Learning an autoencoder function
§ Goal: Learn a compressed representation of the input data. § We have two functions (usually neural networks): § Encoder: z = g (x) § Decoder: Only interesting when z has xˆ = f (z) much smaller dimension ✓ than x!
§ Train using a reconstruction loss: J(x, xˆ)= x xˆ 2 k k = x f (g (x)) 2 k ✓ k William L. Hamilton, McGill University and Mila 7 Autoencoding
Image credit: https://lilianweng.github.io/lil-log/2018/08/12/from-autoencoder-to-beta-vae.html William L. Hamilton, McGill University and Mila 8 Recall: Principal Component Analysis (PCA)
§ Idea: Project data into a lower-dimensional sub-space, Rm -->Rm’, where m’ William L. Hamilton, McGill University and Mila 9 Recall: Principal Component Analysis (PCA) § Idea: Project data into a lower-dimensional sub-space, Rm -->Rm’, where m’ T § Consider a linear mapping, xi --> W xi § W is the compression matrix with dimension Rmxm’. § Assume there is a decompression matrix Um’xm. William L. Hamilton, McGill University and Mila 10 Recall: Principal Component Analysis (PCA) § Idea: Project data into a lower-dimensional sub-space, Rm -->Rm’, where m’ T § Consider a linear mapping, xi --> W xi § W is the compression matrix with dimension Rmxm’. § Assume there is a decompression matrix Um’xm. T 2 § Solve the following problem: argminW,U Σi=1:n || xi – UW xi || William L. Hamilton, McGill University and Mila 11 Recall: Principal Component Analysis (PCA) T 2 § Solve the following problem: argminW,U Σi=1:n || xi – UW xi || T 2 § Equivalently: argminW,U || X – XWU || § Solution is given by eigen-decomposition of XTX. § W is mxm’ matrix corresponding to the first m’ eigenvectors of XTX (sorted in descending order by the magnitude of the eigenvalue). § Equivalently: W is mxm’ matrix containing the first m’ left singular vectors of X § Note: The columns of W are orthogonal! William L. Hamilton, McGill University and Mila 12 PCA vs autoencoders In the case of a linear encoders and decoders: fW(x) = Wx gŴ(h) = W’h , with squared-error reconstruction loss we can show that the minimum error solution W yields the same subspace as PCA. William L. Hamilton, McGill University and Mila 13 More advanced encoders and decoders § What to use as encoders and decoders? § Most data (e.g., arbitrary real-valued or categorical features). § Encoder and decoder are feed-forward neural networks. § Sequence data § Encoder and decoder are RNNs. § Image data § Encoder is a CNN; decoder is a deconvolutional network. William L. Hamilton, McGill University and Mila 14 Aside: Deconvolutions § “Deconvolution” is just a transposed convolution. William L. Hamilton, McGill University and Mila 15 Regularization of autoencoders § How can we generate sparse autoencoders? (And also, why?) William L. Hamilton, McGill University and Mila 16 Regularization of autoencoders § How can we generate sparse autoencoders? (And also, why?) § Weight tying of the encoder and decoder weights (� = �) to explicitly constrain (regularize) the learned function. William L. Hamilton, McGill University and Mila 17 Regularization of autoencoders § How can we generate sparse autoencoders? (And also, why?) § Weight tying of the encoder and decoder weights (� = �) to explicitly constrain (regularize) the learned function. § Directly penalize the output of the hidden units (e.g. with L1 penalty) to introduce sparsity in the weights. William L. Hamilton, McGill University and Mila 18 Regularization of autoencoders § How can we generate sparse autoencoders? (And also, why?) § Weight tying of the encoder and decoder weights (� = �) to explicitly constrain (regularize) the learned function. § Directly penalize the output of the hidden units (e.g. with L1 penalty) to introduce sparsity in the weights. § Penalize the average output (over a batch of data) to encourage it to approach a fixed target. William L. Hamilton, McGill University and Mila 19 Denoising autoencoders William L. Hamilton, McGill University and Mila 20 Extracting and Composing Robust Features with Denoising Autoencoders 2.3. The Denoising Autoencoder Extracting and Composingtowards Robust reconstructing Features with the Denoisinguncorrupted Autoencoders version from the corrupted version. Note that in this way, the au- To test our hypothesis2.3. and The enforce Denoising robustness Autoencoder to par- towards reconstructing the uncorrupted version from toencoder cannot learn the identity, unlike the basic tially destroyed inputs we modify the basic autoen- the corrupted version. Note that in this way, the au- ExtractingTo test and our Composing hypothesis and Robust enforce Features robustnessautoencoder, with to Denoising par- thus Autoencoders removing the constraint that d