Unsupervised Learning: Autoencoders Yunsheng Bai Roadmap
Total Page:16
File Type:pdf, Size:1020Kb
Unsupervised Learning: Autoencoders Yunsheng Bai Roadmap 1. Introduction to Autoencoders 2. Sparse Autoencoders (SAE) (2008) 3. Denoising Autoencoders (DAE) (2008) 4. Contractive Autoencoders (CAE) (2011) 5. Stacked Convolutional Autoencoders (SCAE) (2011) 6. Recursive Autoencoders (RAE) (2011) 7. Variational Autoencoders (VAE) (2013) 8. Adversarial Autoencoders (AAE) (2015) 9. Wasserstein Autoencoders (WAE) (2017) 10. Autoencoders for Graphs Introduction to Autoencoders Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. https://en.wikipedia.org/wiki/Principal_component_analysis#/media/File:GaussianScatterPCA.svg https://www.cs.toronto.edu/~urtasun/courses/CSC411/14_pca.pdf Change of basis Inner product between them https://www.cs.toronto.edu/~urtasun/courses/CSC411/14_pca.pdf PCA ≈ Autoencoder with Linear Activation Function Not necessarily orthogonal Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems https://www.cs.toronto.edu/~urtasun/courses/CSC411/14_pca.pdf PCA ≈ Autoencoder with Linear Activation Function Could have many layers, but as long as activation is linear → a single W and a single V Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems https://www.cs.toronto.edu/~urtasun/courses/CSC411/14_pca.pdf PCA vs Autoencoder — autoencoders are much more flexible than PCA. — NN activation functions introduce “non-linearities” in encoding, but PCA only does linear transformation. — we can stack autoencoders to form a deep autoencoder network https://towardsdatascience.com/autoencoders-are-essential-in-deep-neural-nets-f0365b2d1d7c Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems Layer 1 Layer 2 Layer 3 Layer 4 Stacked Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Python Goal: Learn Useful Features from Data We’ve seen that autoencoders can do PCA, but fundamentally, why does an autoencoder work? https://hackernoon.com/autoencoders-deep-learning-bits-1-11731e200694 Goal: Feature/Representation Learning Why can’t an autoencoder simply copy input to output through identity functions? f g 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 min ||x-g(f(x))|| 0 0 1 0 0 0 0 0 1 0 0 0 2 0 0 0 1 0 0 0 0 0 1 0 0 Overcomplete 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Python To Achieve Feature Learning, Conflicting Goals Autoencoders are designed to be unable to learn to copy perfectly. Usually they are restricted in ways that allow them to copy only approximately. Because the model is forced to prioritize which aspects of the input should be copied, it often learns useful properties of the data. Deep Learning (Adaptive Computation and Machine Learning series) (Ian Goodfellow, Yoshua Bengio, Aaron Courville) “If you could speak only a few words per month, you would probably try to make them worth listening to.” Undercomplete Autoencoders h Encoders and decoders are too powerful :( Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Python http://rgraphgallery.blogspot.com/2013/04/rg-3d-scatter-plots-with-vertical-lines.html Regularized Autoencoders Regularized autoencoders use a loss function that 2008: Sparse Autoencoders (SAE) encourages the model to have other properties besides the ability to copy its input to its output. These other 2008: Denoising Autoencoders (DAE) properties include sparsity of the representation, smallness of the derivative of the representation, and 2011: Contractive Autoencoders (CAE) robustness to noise or to missing inputs. A regularized autoencoder can be nonlinear and 2011: Stacked Convolutional Autoencoders (SCAE) overcomplete but still learn something useful about the data distribution, even if the model capacity is great 2011: Recursive Autoencoders (RAE) enough to learn a trivial identity function. 2013: Variational Autoencoders (VAE) → introduce new things to the loss 2015: Adversarial Autoencoders (AAE) → they are just different regularizers 2017: Wasserstein Autoencoders (WAE) Deep Learning (Adaptive Computation and Machine Learning series) (Ian Goodfellow, Yoshua Bengio, Aaron Courville) Properties of Autoencoders (Ideally) 1. Learn useful features from data (effective representations) a. Capture the intrinsic properties of data → feed them into downstream applications b. Can be thought of as patterns in data → generate new data 2. Produce low-dimensional vectors (efficient/compact representations) a. Efficient for storage b. Efficient for downstream models c. May be free of noise in input d. Easier to visualize than high-dimensional data Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems Properties of Autoencoders (Ideally) 3. Are flexible: Can be modified/guided/regularized in various ways: a. Input data, e.g. add noise b. Output data, e.g. something different from the input c. Architecture, e.g. fully connected layer → convolutional layer d. Loss, e.g. add additional loss terms → capture other useful information from input e. Latent space, e.g. Gaussian (more later in VAE) i. Enforce certain prior knowledge, usually through additional loss terms ii. Analyzing the latent space/representations is a trend (?), e.g. debiasing word embeddings f. … (Be creative! This is where research comes from) History of Autoencoders 10 years ago, we thought that deep nets would also need an unsupervised cost, like the autoencoder cost, to regularize them. Today, we know we are able to recognize images just by using backprop on the supervised cost as long as there is enough labeled data. (Humans can learn from very few labeled examples. Why? One popular hypothesis: Brain can leverage unsupervised or semi-supervised learning.) There are other tasks where we do still use autoencoders, but they’re not the fundamental solution to training deep nets that people once thought they were going to be. (Ian Goodfellow, 2016) https://www.quora.com/Why-are-autoencoders-considered-a-failure-What-are-their-alternatives https://www.doc.ic.ac.uk/~js4416/163/website/nlp/#XGlorot2011 Deep Learning (Adaptive Computation and Machine Learning series) (Ian Goodfellow, Yoshua Bengio, Aaron Courville) Applications of Autoencoders 1. Data Compression for Storage a. Difficult to train an autoencoder better than a basic algorithm like JPEG b. Autoencoders are data-specific: may be hard to generalize to unseen data 2. Dimensionality Reduction for Data Visualization a. t-SNE is good, but typically requires relatively low-dimensional data i. For high-dimensional data, first use autoencode, then use t-SNE b. Latent space visualization (more later) https://blog.keras.io/building-autoencoders-in-keras.html https://www.doc.ic.ac.uk/~js4416/163/website/nlp/#XVincent2008 https://hackernoon.com/latent-space-visualization-deep-learning-bits-2-bd09a46920df Applications of Autoencoders 3. Unsupervised Pretraining a. Greedy Layer-Wise Unsupervised Pretraining: Train each layer of feedforward net greedily; continue stacking layers; output of prior layers is input for the next one; fine tune b. Today, we have random weight initialization, rectified linear units (ReLUs) (2011), dropout (2012), batch normalization (2014), residual learning (2015) + large labeled datasets c. Still useful i. Train a deep autoencoder ii. Train an autoencoder on an unlabeled dataset, and reuse the lower layers to create a new network trained on the labeled data (~supervised pretraining) iii. Train an autoencoder on an unlabeled dataset, and use the learned representations in downstream tasks (see more in 4) https://blog.keras.io/building-autoencoders-in-keras.html https://www.doc.ic.ac.uk/~js4416/163/website/nlp/#XVincent2008 Greedy Layer-Wise Unsupervised Pretraining for Training Deep Autoencoders Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems Unsupervised Pretraining for Supervised Tasks Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems downside: two-staged → hyperparameters tuning :( Unsupervised Pretraining for Supervised Tasks Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems Supervised Pretraining Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems Multi-Task Learning Transfer Learning Domain Adaptation supervised pretraining https://www.youtube.com/watch?v=R3DNKE3zKFk Multi-Task Learning https://www.youtube.com/watch?v=R3DNKE3zKFk Applications of Autoencoders 4. Generate Representations for Downstream Tasks a. Special case of unsupervised pre-training (3.c.iii) b. Useful when the initial representation is poor, and there is a lot of unlabeled data i. Word embeddings (better than one-hot representations) ii. Graph node embeddings iii. Image embeddings (Images already lie in a rich vector space? Check out puppy image embeddings!) iv. Semantic hashing: turn database entries (text, image, etc.) into low-dimensional and binary