Unsupervised Learning: Autoencoders Yunsheng Bai Roadmap
1. Introduction to Autoencoders 2. Sparse Autoencoders (SAE) (2008) 3. Denoising Autoencoders (DAE) (2008) 4. Contractive Autoencoders (CAE) (2011) 5. Stacked Convolutional Autoencoders (SCAE) (2011) 6. Recursive Autoencoders (RAE) (2011) 7. Variational Autoencoders (VAE) (2013) 8. Adversarial Autoencoders (AAE) (2015) 9. Wasserstein Autoencoders (WAE) (2017) 10. Autoencoders for Graphs Introduction to Autoencoders
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
https://en.wikipedia.org/wiki/Principal_component_analysis#/media/File:GaussianScatterPCA.svg https://www.cs.toronto.edu/~urtasun/courses/CSC411/14_pca.pdf Change of basis
Inner product between them https://www.cs.toronto.edu/~urtasun/courses/CSC411/14_pca.pdf PCA ≈ Autoencoder with Linear Activation Function
Not necessarily orthogonal
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems https://www.cs.toronto.edu/~urtasun/courses/CSC411/14_pca.pdf PCA ≈ Autoencoder with Linear Activation Function
Could have many layers, but as long as activation is linear → a single W and a single V
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems https://www.cs.toronto.edu/~urtasun/courses/CSC411/14_pca.pdf PCA vs Autoencoder — autoencoders are much more flexible than PCA. — NN activation functions introduce “non-linearities” in encoding, but PCA only does linear transformation. — we can stack autoencoders to form a deep autoencoder network https://towardsdatascience.com/autoencoders-are-essential-in-deep-neural-nets-f0365b2d1d7c Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems Layer 1 Layer 2 Layer 3 Layer 4 Stacked
Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Python Goal: Learn Useful Features from Data
We’ve seen that autoencoders can do PCA, but fundamentally, why does an autoencoder work?
https://hackernoon.com/autoencoders-deep-learning-bits-1-11731e200694 Goal: Feature/Representation Learning
Why can’t an autoencoder simply copy input to output through identity functions?
f g
1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 min ||x-g(f(x))|| 0 0 1 0 0 0 0 0 1 0 0 0 2 0 0 0 1 0 0 0 0 0 1 0 0 Overcomplete 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1
Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Python To Achieve Feature Learning, Conflicting Goals
Autoencoders are designed to be unable to learn to copy perfectly. Usually they are restricted in ways that allow them to copy only approximately. Because the model is forced to prioritize which aspects of the input should be copied, it often learns useful properties of the data.
Deep Learning (Adaptive Computation and Machine Learning series) (Ian Goodfellow, Yoshua Bengio, Aaron Courville) “If you could speak only a few words per month, you would probably try to make them worth listening to.” Undercomplete Autoencoders
h Encoders and decoders are too powerful :(
Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Python http://rgraphgallery.blogspot.com/2013/04/rg-3d-scatter-plots-with-vertical-lines.html Regularized Autoencoders
Regularized autoencoders use a loss function that 2008: Sparse Autoencoders (SAE) encourages the model to have other properties besides the ability to copy its input to its output. These other 2008: Denoising Autoencoders (DAE) properties include sparsity of the representation, smallness of the derivative of the representation, and 2011: Contractive Autoencoders (CAE) robustness to noise or to missing inputs. A regularized autoencoder can be nonlinear and 2011: Stacked Convolutional Autoencoders (SCAE) overcomplete but still learn something useful about the data distribution, even if the model capacity is great 2011: Recursive Autoencoders (RAE) enough to learn a trivial identity function. 2013: Variational Autoencoders (VAE) → introduce new things to the loss 2015: Adversarial Autoencoders (AAE) → they are just different regularizers 2017: Wasserstein Autoencoders (WAE)
Deep Learning (Adaptive Computation and Machine Learning series) (Ian Goodfellow, Yoshua Bengio, Aaron Courville) Properties of Autoencoders (Ideally)
1. Learn useful features from data (effective representations) a. Capture the intrinsic properties of data → feed them into downstream applications b. Can be thought of as patterns in data → generate new data 2. Produce low-dimensional vectors (efficient/compact representations) a. Efficient for storage b. Efficient for downstream models c. May be free of noise in input d. Easier to visualize than high-dimensional data
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems Properties of Autoencoders (Ideally)
3. Are flexible: Can be modified/guided/regularized in various ways: a. Input data, e.g. add noise b. Output data, e.g. something different from the input c. Architecture, e.g. fully connected layer → convolutional layer d. Loss, e.g. add additional loss terms → capture other useful information from input e. Latent space, e.g. Gaussian (more later in VAE) i. Enforce certain prior knowledge, usually through additional loss terms ii. Analyzing the latent space/representations is a trend (?), e.g. debiasing word embeddings f. … (Be creative! This is where research comes from) History of Autoencoders
10 years ago, we thought that deep nets would also need an unsupervised cost, like the autoencoder cost, to regularize them.
Today, we know we are able to recognize images just by using backprop on the supervised cost as long as there is enough labeled data.
(Humans can learn from very few labeled examples. Why? One popular hypothesis: Brain can leverage unsupervised or semi-supervised learning.)
There are other tasks where we do still use autoencoders, but they’re not the fundamental solution to training deep nets that people once thought they were going to be.
(Ian Goodfellow, 2016) https://www.quora.com/Why-are-autoencoders-considered-a-failure-What-are-their-alternatives https://www.doc.ic.ac.uk/~js4416/163/website/nlp/#XGlorot2011 Deep Learning (Adaptive Computation and Machine Learning series) (Ian Goodfellow, Yoshua Bengio, Aaron Courville) Applications of Autoencoders
1. Data Compression for Storage a. Difficult to train an autoencoder better than a basic algorithm like JPEG b. Autoencoders are data-specific: may be hard to generalize to unseen data 2. Dimensionality Reduction for Data Visualization a. t-SNE is good, but typically requires relatively low-dimensional data i. For high-dimensional data, first use autoencode, then use t-SNE b. Latent space visualization (more later)
https://blog.keras.io/building-autoencoders-in-keras.html https://www.doc.ic.ac.uk/~js4416/163/website/nlp/#XVincent2008 https://hackernoon.com/latent-space-visualization-deep-learning-bits-2-bd09a46920df Applications of Autoencoders
3. Unsupervised Pretraining a. Greedy Layer-Wise Unsupervised Pretraining: Train each layer of feedforward net greedily; continue stacking layers; output of prior layers is input for the next one; fine tune b. Today, we have random weight initialization, rectified linear units (ReLUs) (2011), dropout (2012), batch normalization (2014), residual learning (2015) + large labeled datasets c. Still useful i. Train a deep autoencoder ii. Train an autoencoder on an unlabeled dataset, and reuse the lower layers to create a new network trained on the labeled data (~supervised pretraining) iii. Train an autoencoder on an unlabeled dataset, and use the learned representations in downstream tasks (see more in 4)
https://blog.keras.io/building-autoencoders-in-keras.html https://www.doc.ic.ac.uk/~js4416/163/website/nlp/#XVincent2008 Greedy Layer-Wise Unsupervised Pretraining for Training Deep Autoencoders
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems Unsupervised Pretraining for Supervised Tasks
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems downside: two-staged → hyperparameters tuning :( Unsupervised Pretraining for Supervised Tasks
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems Supervised Pretraining
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems Multi-Task Learning
Transfer Learning Domain Adaptation
supervised pretraining
https://www.youtube.com/watch?v=R3DNKE3zKFk Multi-Task Learning
https://www.youtube.com/watch?v=R3DNKE3zKFk Applications of Autoencoders
4. Generate Representations for Downstream Tasks a. Special case of unsupervised pre-training (3.c.iii) b. Useful when the initial representation is poor, and there is a lot of unlabeled data i. Word embeddings (better than one-hot representations) ii. Graph node embeddings iii. Image embeddings (Images already lie in a rich vector space? Check out puppy image embeddings!) iv. Semantic hashing: turn database entries (text, image, etc.) into low-dimensional and binary codes → Information retrieval c. Question: If there are labels, is there any reason to use a decoder with a reconstruction loss? 5. Generate New Data (Generative Model) a. Especially, Variational Autoencoders (VAE), Adversarial Autoencoders (AAE) (more later) b. Creative applications (more later)
Deep Learning (Adaptive Computation and Machine Learning series) (Ian Goodfellow, Yoshua Bengio, Aaron Courville) graph node embedding
Or logistic regression SVM, classifier, etc.
Copy output of Layer 2 (embedding)
using the Hidden 2 as input Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems semantic hashing
Compare the codes Hidden rep. Copy output of Layer 2 (code)
Database Query
Query
in database Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems Applications of Autoencoders
6. Self-supervised Learning a. ∈ supervised learning where the targets are generated from the input data b. Merely learning to reconstruct the input might not be enough to learn abstract features of the kind that label-supervised learning induces (where targets are "dog", "car"...) i. Data denoising
ii. Jigsaw puzzle solver iii. ... https://blog.keras.io/building-autoencoders-in-keras.html https://www.doc.ic.ac.uk/~js4416/163/website/nlp/#XVincent2008 Skipgram vs Autoencoders
1. In NLP word embeddings, why is Skipgram more popular than autoencoders? a. Simpler b. More efficient c. Works well already 2. When does Skipgram no longer suffice? Additional goals, e.g. a. Denoising b. Complex characteristics of word use + polysemy → Use bidirectional LSTM with attention as the encoder! c. Generative setting (generate new data) d. Inductive setting (embed unseen words) 3. Can Skipgram be viewed as a special case of some autoencoder model? a. In fact, encoding and decoding are very general concepts and are used in many places
Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013). Peters, Matthew E., et al. "Deep contextualized word representations." arXiv preprint arXiv:1802.05365 (2018). Roadmap
1. Introduction to Autoencoders 2. Sparse Autoencoders (SAE) (2008) 3. Denoising Autoencoders (DAE) (2008) 4. Contractive Autoencoders (CAE) (2011) 5. Stacked Convolutional Autoencoders (SCAE) (2011) 6. Recursive Autoencoders (RAE) (2011) 7. Variational Autoencoders (VAE) (2013) 8. Adversarial Autoencoders (AAE) (2015) 9. Wasserstein Autoencoders (WAE) (2017) 10. Autoencoders for Graphs Sparse Autoencoders (SAE) (2008) An image should be represented by only a few bases. Motivation 1: Sparse Coding
https://www.cs.ubc.ca/~schmidtm/MLRG/sparseCoding.pdf A document should be about only a few topics. Motivation 1: Sparse Coding
https://www.cs.ubc.ca/~schmidtm/MLRG/sparseCoding.pdf Motivation 1: Sparse Coding
https://www.cs.ubc.ca/~schmidtm/MLRG/sparseCoding.pdf “If you could speak only a few words per month, you would probably try to make them worth listening to.” Motivation 1: Sparse Coding Change of basis + Sparsity constraint
DT x = DTx=
Dh= D = x DDT =x → DDT=I
https://www.cs.ubc.ca/~schmidtm/MLRG/sparseCoding.pdf Recall PCA Change of basis
Inner product between them https://www.cs.toronto.edu/~urtasun/courses/CSC411/14_pca.pdf Motivation 2: Prevent Identity Transform
1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 f g 0 0 W0 0T 1 0 x = h WTx=h 0 0 0 0 0 1 0 0 0 0 0 0 f(x)=h 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 W ≈ I 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 Wh= W h = x 0 0 0 1 0 0 0 0 0 WWT =x 0 0 0 0 1 0 0 0 0 T 0 0 0 0 0 1 0 0 0 → WW =I (fine) g(h)=x
Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Python Motivation 2: Prevent Identity Transform
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 In the case of 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 image, we can think 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 of W as a set of 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 convolution filters 0 0 0 0 0 0 0 0W 1 0 0 0 0 0 0 0 (each with the 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 same size as the 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 input, e.g. 4x4). 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0.3 0.2 0.1 + ...
Same as input, i.e. x https://www.cs.ubc.ca/~schmidtm/MLRG/sparseCoding.pdf Motivation 2: Prevent Identity Transform
1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 + ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
https://www.cs.ubc.ca/~schmidtm/MLRG/sparseCoding.pdf activation of hidden unit j of layer 2 (assume two layers in encoder) Sparse Autoencoders
f g
# training samples Reconstruction loss Regularization term Sparsity penalty
ThisPro Deep results Learning in withsparse TensorFlow: activation A Mathematical of hidden Approach units to Advancedacross trainingArtificial Intelligence points, inbut Python does not guarantee that each input has aHands-On sparse Machine representation. Learning with Scikit-Learn(Makhzani, and Alireza, TensorFlow: and Brendan Concepts, Frey. Tools, "K-sparse and Techniques autoencoders." to Build IntelligentarXiv preprint Systems arXiv:1312.5663 (2013).) http://web.stanford.edu/class/cs294a/sparseAutoencoder.pdf Results
Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Python Techniques to Interpret Autoencoders
1. Visualize the weight matrix W a. Each column of W corresponds to the weights of a particular neuron b. When there is a natural interpretation of the weights, can visualize them i. Especially true in the case of image as seen previously (~convolution filters) ii. Especially true for the top hidden layers since they often capture relatively large features 2. Visualize the most exciting input per neuron a. Treat each neuron as a feature detector. To find the feature a particular neuron is looking for, i. Feed a random input ii. Measure the activation of the neuron you are interested in iii. Perform backpropagation to tweak the input so that the neuron will activate even more (gradient ascent) iv. Iterate several times
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems Roadmap
1. Introduction to Autoencoders 2. Sparse Autoencoders (SAE) (2008) 3. Denoising Autoencoders (DAE) (2008) 4. Contractive Autoencoders (CAE) (2011) 5. Stacked Convolutional Autoencoders (SCAE) (2011) 6. Recursive Autoencoders (RAE) (2011) 7. Variational Autoencoders (VAE) (2013) 8. Adversarial Autoencoders (AAE) (2015) 9. Wasserstein Autoencoders (WAE) (2017) 10. Autoencoders for Graphs Denoising Autoencoders (DAE) (2008) Sparse Coding Could Also Handle Image Denoising
Key: the use of sparse and redundant representations over trained dictionaries.
https://www.cs.ubc.ca/~schmidtm/MLRG/sparseCoding.pdf Elad, Michael, and Michal Aharon. "Image denoising via learned dictionaries and sparse representation." Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on. Vol. 1. IEEE, 2006. Denoising Autoencoders: Implementation-level
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems Denoising Autoencoders: Results
Gaussian noise
Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Python Denoising Autoencoders: Results
Salt and pepper noise
Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Python Denoising Autoencoders: Research-level
Why equivalent to reconstruction loss
? (1) Intuitively (2) Recall Least Square Estimate is same as Maximum Likelihood Estimate under a Gaussian model
Deep Learning (Adaptive Computation and Machine Learning series) (Ian Goodfellow, Yoshua Bengio, Aaron Courville) Roadmap
1. Introduction to Autoencoders 2. Sparse Autoencoders (SAE) (2008) 3. Denoising Autoencoders (DAE) (2008) 4. Contractive Autoencoders (CAE) (2011) 5. Stacked Convolutional Autoencoders (SCAE) (2011) 6. Recursive Autoencoders (RAE) (2011) 7. Variational Autoencoders (VAE) (2013) 8. Adversarial Autoencoders (AAE) (2015) 9. Wasserstein Autoencoders (WAE) (2017) 10. Autoencoders for Graphs Contractive Autoencoders (CAE) (2011) CAE: Resist Infinitesimal Perturbations of Input
All autoencoder training procedures involve a compromise between two opposing forces: being data-specific and being data-insensitive.
CAE and DAE are equivalent under certain conditions.
Deep Learning (Adaptive Computation and Machine Learning series) (Ian Goodfellow, Yoshua Bengio, Aaron Courville) Roadmap
1. Introduction to Autoencoders 2. Sparse Autoencoders (SAE) (2008) 3. Denoising Autoencoders (DAE) (2008) 4. Contractive Autoencoders (CAE) (2011) 5. Stacked Convolutional Autoencoders (SCAE) (2011) 6. Recursive Autoencoders (RAE) (2011) 7. Variational Autoencoders (VAE) (2013) 8. Adversarial Autoencoders (AAE) (2015) 9. Wasserstein Autoencoders (WAE) (2017) 10. Autoencoders for Graphs Stacked Convolutional Autoencoders (SCAE) (2011) SCAE
Use convolutional + pooling layers instead of fully connected layers.
Dong, Chao, et al. "Image super-resolution using deep convolutional networks." IEEE transactions on pattern analysis and machine intelligence 38.2 (2016): 295-307. Image Deblurring/Denoising/Super-Resolution and Image Colorization
Dong, Chao, et al. "Image super-resolution using deep convolutional networks." IEEE transactions on pattern analysis and machine intelligence 38.2 (2016): 295-307. https://hackernoon.com/autoencoders-deep-learning-bits-1-11731e200694 Roadmap
1. Introduction to Autoencoders 2. Sparse Autoencoders (SAE) (2008) 3. Denoising Autoencoders (DAE) (2008) 4. Contractive Autoencoders (CAE) (2011) 5. Stacked Convolutional Autoencoders (SCAE) (2011) 6. Recursive Autoencoders (RAE) (2011) 7. Variational Autoencoders (VAE) (2013) 8. Adversarial Autoencoders (AAE) (2015) 9. Wasserstein Autoencoders (WAE) (2017) 10. Autoencoders for Graphs Recursive Autoencoders (RAE) (2011) Sentence Representation
Why not simple average? “white blood cells destroying an infection” ≠ “an infection destroying white blood cells”
Socher, Richard, et al. "Semi-supervised recursive autoencoders for predicting sentiment distributions." Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2011. Sentence Representation Could penalize top-level nodes more heavily, which contain more children
Could predict all children underneath → unfolding RAE
Could introduce a supervised loss
Could normalize the hidden representations Could use many layers Could use parse tree
https://www.doc.ic.ac.uk/~js4416/163/website/nlp/recursive.html Socher, Richard, et al. "Dynamic pooling and unfolding recursive autoencoders for paraphrase detection." Advances in neural information processing systems. 2011. Roadmap
1. Introduction to Autoencoders 2. Sparse Autoencoders (SAE) (2008) 3. Denoising Autoencoders (DAE) (2008) 4. Contractive Autoencoders (CAE) (2011) 5. Stacked Convolutional Autoencoders (SCAE) (2011) 6. Recursive Autoencoders (RAE) (2011) 7. Variational Autoencoders (VAE) (2013) 8. Adversarial Autoencoders (AAE) (2015) 9. Wasserstein Autoencoders (WAE) (2017) 10. Autoencoders for Graphs Variational Autoencoders (VAE) (2013) Encoder Outputs Statistical Distributions; Feed Samples into Decoder → Add Noise at All Times; Generate New Data After Training VAE: Intuition
https://www.jeremyjordan.me/variational-autoencoders/ VAE: Implementation-level
Assume the prior distribution of z, i.e. p(z) to be Gaussian → encourage the learned posterior q(z|x) to be similar to p(z) through an Probabilistic (produce additional loss term and even after measuring their KL training) + generative divergence autoencoders -
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems Variational Bayesian Inference VAE: Research-level
generative model probabilistic decoder (generative model)
variational approximation latent probabilistic encoder representation (recognition model) or code to the intractable true posterior
Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." arXiv preprint arXiv:1312.6114 (2013). https://www.jeremyjordan.me/variational-autoencoders/ MusicVAE: Generative Model → Creative Artists
The desirable properties of a latent space can be summarized as follows:
1. Expression: Any real example can be mapped to some point in the latent space and reconstructed from it. 2. Realism: Any point in this space represents some realistic example, including ones not in the training set. 3. Smoothness: Examples from nearby points in latent space have similar qualities to one another.
https://experiments.withgoogle.com/ai/beat-blender/view/ https://magenta.tensorflow.org/music-vae Roberts, Adam, et al. "A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music." arXiv preprint arXiv:1803.05428 (2018). Key: Design Latent Space Properties
“holes” :(
Learn smooth latent state representations of the input data. Good for interpolation, sampling, generation, downstream https://www.jeremyjordan.me/variational-autoencoders/ classification, etc. Interpolation → Smooth Transformation
https://www.jeremyjordan.me/variational-autoencoders/ SketchRNN: Seq2seq + Variational Autoencoder
Arithmetic operations on sketch embeddings!
Smoothness of latent space
sequence-to-sequence (seq2seq) autoencoder framework with variational inference sketch → sequence of motor actions controlling a pen (how about text or graph as a sequence?) by adding noise to the latent vector, the model cannot reproduce the input sketch exactly https://research.googleblog.com/2017/04/teaching-machines-to-draw.html Roadmap
1. Introduction to Autoencoders 2. Sparse Autoencoders (SAE) (2008) 3. Denoising Autoencoders (DAE) (2008) 4. Contractive Autoencoders (CAE) (2011) 5. Stacked Convolutional Autoencoders (SCAE) (2011) 6. Recursive Autoencoders (RAE) (2011) 7. Variational Autoencoders (VAE) (2013) 8. Adversarial Autoencoders (AAE) (2015) 9. Wasserstein Autoencoders (WAE) (2017) 10. Autoencoders for Graphs Adversarial Autoencoders (AAE) (2015) AAE: Regularized by An Adversarial Network Which Guides Posterior q(z|x) to Match Any Arbitrary Prior p(z)
VAE AAE
Prior p(z) Additional DKL loss term
Makhzani, Alireza, et al. "Adversarial autoencoders." arXiv preprint arXiv:1511.05644 (2015). AAE: Design Arbitrary Prior
Makhzani, Alireza, et al. "Adversarial autoencoders." arXiv preprint arXiv:1511.05644 (2015). AAE: Labels Can Further Guide (Semi-Supervised)
Makhzani, Alireza, et al. "Adversarial autoencoders." arXiv preprint arXiv:1511.05644 (2015). Roadmap
1. Introduction to Autoencoders 2. Sparse Autoencoders (SAE) (2008) 3. Denoising Autoencoders (DAE) (2008) 4. Contractive Autoencoders (CAE) (2011) 5. Stacked Convolutional Autoencoders (SCAE) (2011) 6. Recursive Autoencoders (RAE) (2011) 7. Variational Autoencoders (VAE) (2013) 8. Adversarial Autoencoders (AAE) (2015) 9. Wasserstein Autoencoders (WAE) (2017) 10. Autoencoders for Graphs Wasserstein Autoencoders (WAE) (2017) WAE: Motivation
VAE GAN
Pros: Pros:
1. Theoretically elegant 1. Good visual quality of images 2. Stable training 3. Encoder-decoder architecture Cons: 4. Nice latent manifold structure 1. Harder to train Cons: 2. No encoder; only a decoder/generator and a discriminator 1. Tend to generate blurry samples 3. “Mode collapse” problem 4. ~JS divergence, “worse” than Wasserstein distance (see details in the paper)
Tolstikhin, Ilya, et al. "Wasserstein Auto-Encoders." arXiv preprint arXiv:1711.01558 (2017). Combine VAE + GAN in A Principled Way?
VAE GAN encoder decoder decoder/ generator
Prior p(z) Additional DKL loss term
discriminator
Tolstikhin, Ilya, et al. "Wasserstein Auto-Encoders." arXiv preprint arXiv:1711.01558 (2017). WAE
A generalization of AAE; minimizes Wasserstein distance between the model and the target distribution.
AAE
Tolstikhin, Ilya, et al. "Wasserstein Auto-Encoders." arXiv preprint arXiv:1711.01558 (2017). https://openreview.net/forum?id=HkL7n1-0b Roadmap
1. Introduction to Autoencoders 2. Sparse Autoencoders (SAE) (2008) 3. Denoising Autoencoders (DAE) (2008) 4. Contractive Autoencoders (CAE) (2011) 5. Stacked Convolutional Autoencoders (SCAE) (2011) 6. Recursive Autoencoders (RAE) (2011) 7. Variational Autoencoders (VAE) (2013) 8. Adversarial Autoencoders (AAE) (2015) 9. Wasserstein Autoencoders (WAE) (2017) 10. Autoencoders for Graphs Autoencoders for Graphs Graphs Are Different
1. Are there smooth linear interpolations? Arithmetic operations? 2. Graph is composed of correlated substructures a. E.g. Two triangles → rectangle b. Hierarchy: pixels (atomic) → patterns → images; words (atomic) → phrases → sentences → paragraphs/documents; nodes (atomic) → substructures → graphs (transfer learning) 3. Graphs are of different sizes 4. Graph nodes lack order 5. How to detect substructures? a. For image, convolutional layers → SCAE b. For graph, graph convolutional layers → node/substructure/graph? c. Some people treat graph as sequences/random walks → “deconstruction” view i. ~Parse sentences into trees instead of feeding into LSTM d. How about decompose graphs into equal-size subgraphs? GraphVAE
Simonovsky, Martin, and Nikos Komodakis. "GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders." arXiv preprint arXiv:1802.03480 (2018). Simonovsky, Martin, and Nikos Komodakis. "GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders." arXiv preprint arXiv:1802.03480 (2018). Thank you!