<<

Unsupervised Learning: Yunsheng Bai Roadmap

1. Introduction to Autoencoders 2. Sparse Autoencoders (SAE) (2008) 3. Denoising Autoencoders (DAE) (2008) 4. Contractive Autoencoders (CAE) (2011) 5. Stacked Convolutional Autoencoders (SCAE) (2011) 6. Recursive Autoencoders (RAE) (2011) 7. Variational Autoencoders (VAE) (2013) 8. Adversarial Autoencoders (AAE) (2015) 9. Wasserstein Autoencoders (WAE) (2017) 10. Autoencoders for Graphs Introduction to Autoencoders

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

https://en.wikipedia.org/wiki/Principal_component_analysis#/media/File:GaussianScatterPCA.svg https://www.cs.toronto.edu/~urtasun/courses/CSC411/14_pca.pdf Change of basis

Inner product between them https://www.cs.toronto.edu/~urtasun/courses/CSC411/14_pca.pdf PCA ≈ with Linear

Not necessarily orthogonal

Hands-On with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems https://www.cs.toronto.edu/~urtasun/courses/CSC411/14_pca.pdf PCA ≈ Autoencoder with Linear Activation Function

Could have many layers, but as long as activation is linear → a single W and a single V

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems https://www.cs.toronto.edu/~urtasun/courses/CSC411/14_pca.pdf PCA vs Autoencoder — autoencoders are much more flexible than PCA. — NN activation functions introduce “non-linearities” in encoding, but PCA only does linear transformation. — we can stack autoencoders to form a deep autoencoder network https://towardsdatascience.com/autoencoders-are-essential-in-deep-neural-nets-f0365b2d1d7c Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems 1 Layer 2 Layer 3 Layer 4 Stacked

Pro with TensorFlow: A Mathematical Approach to Advanced in Python Goal: Learn Useful Features from Data

We’ve seen that autoencoders can do PCA, but fundamentally, why does an autoencoder work?

https://hackernoon.com/autoencoders-deep-learning-bits-1-11731e200694 Goal: Feature/Representation Learning

Why can’t an autoencoder simply copy input to output through identity functions?

f g

1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 min ||x-g(f(x))|| 0 0 1 0 0 0 0 0 1 0 0 0 2 0 0 0 1 0 0 0 0 0 1 0 0 Overcomplete 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1

Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Python To Achieve , Conflicting Goals

Autoencoders are designed to be unable to learn to copy perfectly. Usually they are restricted in ways that allow them to copy only approximately. Because the model is forced to prioritize which aspects of the input should be copied, it often learns useful properties of the data.

Deep Learning (Adaptive Computation and Machine Learning series) (, , Aaron Courville) “If you could speak only a few words per month, you would probably try to make them worth listening to.” Undercomplete Autoencoders

h Encoders and decoders are too powerful :(

Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Python http://rgraphgallery.blogspot.com/2013/04/rg-3d-scatter-plots-with-vertical-lines.html Regularized Autoencoders

Regularized autoencoders use a that 2008: Sparse Autoencoders (SAE) encourages the model to have other properties besides the ability to copy its input to its output. These other 2008: Denoising Autoencoders (DAE) properties include sparsity of the representation, smallness of the derivative of the representation, and 2011: Contractive Autoencoders (CAE) robustness to noise or to missing inputs. A regularized autoencoder can be nonlinear and 2011: Stacked Convolutional Autoencoders (SCAE) overcomplete but still learn something useful about the data distribution, even if the model capacity is great 2011: Recursive Autoencoders (RAE) enough to learn a trivial identity function. 2013: Variational Autoencoders (VAE) → introduce new things to the loss 2015: Adversarial Autoencoders (AAE) → they are just different regularizers 2017: Wasserstein Autoencoders (WAE)

Deep Learning (Adaptive Computation and Machine Learning series) (Ian Goodfellow, Yoshua Bengio, Aaron Courville) Properties of Autoencoders (Ideally)

1. Learn useful features from data (effective representations) a. Capture the intrinsic properties of data → feed them into downstream applications b. Can be thought of as patterns in data → generate new data 2. Produce low-dimensional vectors (efficient/compact representations) a. Efficient for storage b. Efficient for downstream models c. May be free of noise in input d. Easier to visualize than high-dimensional data

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems Properties of Autoencoders (Ideally)

3. Are flexible: Can be modified/guided/regularized in various ways: a. Input data, e.g. add noise b. Output data, e.g. something different from the input c. Architecture, e.g. fully connected layer → convolutional layer d. Loss, e.g. add additional loss terms → capture other useful information from input e. Latent space, e.g. Gaussian (more later in VAE) i. Enforce certain prior knowledge, usually through additional loss terms ii. Analyzing the latent space/representations is a trend (?), e.g. debiasing word embeddings f. … (Be creative! This is where research comes from) History of Autoencoders

10 years ago, we thought that deep nets would also need an unsupervised cost, like the autoencoder cost, to regularize them.

Today, we know we are able to recognize images just by using backprop on the supervised cost as long as there is enough .

(Humans can learn from very few labeled examples. Why? One popular hypothesis: Brain can leverage unsupervised or semi-.)

There are other tasks where we do still use autoencoders, but they’re not the fundamental solution to training deep nets that people once thought they were going to be.

(Ian Goodfellow, 2016) https://www.quora.com/Why-are-autoencoders-considered-a-failure-What-are-their-alternatives https://www.doc.ic.ac.uk/~js4416/163/website/nlp/#XGlorot2011 Deep Learning (Adaptive Computation and Machine Learning series) (Ian Goodfellow, Yoshua Bengio, Aaron Courville) Applications of Autoencoders

1. Data Compression for Storage a. Difficult to train an autoencoder better than a basic like JPEG b. Autoencoders are data-specific: may be hard to generalize to unseen data 2. for Data a. t-SNE is good, but typically requires relatively low-dimensional data i. For high-dimensional data, first use autoencode, then use t-SNE b. Latent space visualization (more later)

https://blog.keras.io/building-autoencoders-in-.html https://www.doc.ic.ac.uk/~js4416/163/website/nlp/#XVincent2008 https://hackernoon.com/latent-space-visualization-deep-learning-bits-2-bd09a46920df Applications of Autoencoders

3. Unsupervised Pretraining a. Greedy Layer-Wise Unsupervised Pretraining: Train each layer of feedforward net greedily; continue stacking layers; output of prior layers is input for the next one; fine tune b. Today, we have random weight initialization, rectified linear units (ReLUs) (2011), dropout (2012), (2014), residual learning (2015) + large labeled datasets c. Still useful i. Train a deep autoencoder ii. Train an autoencoder on an unlabeled dataset, and reuse the lower layers to create a new network trained on the labeled data (~supervised pretraining) iii. Train an autoencoder on an unlabeled dataset, and use the learned representations in downstream tasks (see more in 4)

https://blog.keras.io/building-autoencoders-in-keras.html https://www.doc.ic.ac.uk/~js4416/163/website/nlp/#XVincent2008 Greedy Layer-Wise Unsupervised Pretraining for Training Deep Autoencoders

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems Unsupervised Pretraining for Supervised Tasks

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems downside: two-staged → hyperparameters tuning :( Unsupervised Pretraining for Supervised Tasks

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems Supervised Pretraining

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems Multi-Task Learning

Transfer Learning Domain Adaptation

supervised pretraining

https://www.youtube.com/watch?v=R3DNKE3zKFk Multi-Task Learning

https://www.youtube.com/watch?v=R3DNKE3zKFk Applications of Autoencoders

4. Generate Representations for Downstream Tasks a. Special case of unsupervised pre-training (3.c.iii) b. Useful when the initial representation is poor, and there is a lot of unlabeled data i. Word embeddings (better than one-hot representations) ii. Graph node embeddings iii. Image embeddings (Images already lie in a rich vector space? Check out puppy image embeddings!) iv. Semantic hashing: turn entries (text, image, etc.) into low-dimensional and binary codes → Information retrieval c. Question: If there are labels, is there any reason to use a decoder with a reconstruction loss? 5. Generate New Data (Generative Model) a. Especially, Variational Autoencoders (VAE), Adversarial Autoencoders (AAE) (more later) b. Creative applications (more later)

Deep Learning (Adaptive Computation and Machine Learning series) (Ian Goodfellow, Yoshua Bengio, Aaron Courville) graph node embedding

Or SVM, classifier, etc.

Copy output of Layer 2 (embedding)

using the Hidden 2 as input Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems semantic hashing

Compare the codes Hidden rep. Copy output of Layer 2 (code)

Database Query

Query

in database Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems Applications of Autoencoders

6. Self-supervised Learning a. ∈ supervised learning where the targets are generated from the input data b. Merely learning to reconstruct the input might not be enough to learn abstract features of the kind that label-supervised learning induces (where targets are "dog", "car"...) i. Data denoising

ii. Jigsaw puzzle solver iii. ... https://blog.keras.io/building-autoencoders-in-keras.html https://www.doc.ic.ac.uk/~js4416/163/website/nlp/#XVincent2008 Skipgram vs Autoencoders

1. In NLP word embeddings, why is Skipgram more popular than autoencoders? a. Simpler b. More efficient c. Works well already 2. When does Skipgram no longer suffice? Additional goals, e.g. a. Denoising b. Complex characteristics of word use + polysemy → Use bidirectional LSTM with attention as the encoder! c. Generative setting (generate new data) d. Inductive setting (embed unseen words) 3. Can Skipgram be viewed as a special case of some autoencoder model? a. In fact, encoding and decoding are very general concepts and are used in many places

Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013). Peters, Matthew E., et al. "Deep contextualized word representations." arXiv preprint arXiv:1802.05365 (2018). Roadmap

1. Introduction to Autoencoders 2. Sparse Autoencoders (SAE) (2008) 3. Denoising Autoencoders (DAE) (2008) 4. Contractive Autoencoders (CAE) (2011) 5. Stacked Convolutional Autoencoders (SCAE) (2011) 6. Recursive Autoencoders (RAE) (2011) 7. Variational Autoencoders (VAE) (2013) 8. Adversarial Autoencoders (AAE) (2015) 9. Wasserstein Autoencoders (WAE) (2017) 10. Autoencoders for Graphs Sparse Autoencoders (SAE) (2008) An image should be represented by only a few bases. Motivation 1: Sparse Coding

https://www.cs.ubc.ca/~schmidtm/MLRG/sparseCoding.pdf A document should be about only a few topics. Motivation 1: Sparse Coding

https://www.cs.ubc.ca/~schmidtm/MLRG/sparseCoding.pdf Motivation 1: Sparse Coding

https://www.cs.ubc.ca/~schmidtm/MLRG/sparseCoding.pdf “If you could speak only a few words per month, you would probably try to make them worth listening to.” Motivation 1: Sparse Coding Change of basis + Sparsity constraint

DT x = DTx=

Dh= D = x DDT=x → DDT=I

https://www.cs.ubc.ca/~schmidtm/MLRG/sparseCoding.pdf Recall PCA Change of basis

Inner product between them https://www.cs.toronto.edu/~urtasun/courses/CSC411/14_pca.pdf Motivation 2: Prevent Identity Transform

1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 f g 0 0 W0 0T 1 0 x = h WTx=h 0 0 0 0 0 1 0 0 0 0 0 0 f(x)=h 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 W ≈ I 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 Wh= W h = x 0 0 0 1 0 0 0 0 0 WWT=x 0 0 0 0 1 0 0 0 0 T 0 0 0 0 0 1 0 0 0 → WW =I (fine) g(h)=x

Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Python Motivation 2: Prevent Identity Transform

1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 In the case of 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 image, we can think 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 of W as a set of 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 filters 0 0 0 0 0 0 0 0W 1 0 0 0 0 0 0 0 (each with the 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 same size as the 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 input, e.g. 4x4). 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0.3 0.2 0.1 + ...

Same as input, i.e. x https://www.cs.ubc.ca/~schmidtm/MLRG/sparseCoding.pdf Motivation 2: Prevent Identity Transform

1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 + ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

https://www.cs.ubc.ca/~schmidtm/MLRG/sparseCoding.pdf activation of hidden unit j of layer 2 (assume two layers in encoder) Sparse Autoencoders

f g

# training samples Reconstruction loss Regularization term Sparsity penalty

ThisPro Deep results Learning in withsparse TensorFlow: activation A Mathematical of hidden Approach units to Advancedacross trainingArtificial Intelligence points, inbut Python does not guarantee that each input has aHands-On sparse Machine representation. Learning with Scikit-Learn(Makhzani, and Alireza, TensorFlow: and Brendan Concepts, Frey. Tools, "K-sparse and Techniques autoencoders." to Build IntelligentarXiv preprint Systems arXiv:1312.5663 (2013).) http://web.stanford.edu/class/cs294a/sparseAutoencoder.pdf Results

Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Python Techniques to Interpret Autoencoders

1. Visualize the weight matrix W a. Each column of W corresponds to the weights of a particular neuron b. When there is a natural interpretation of the weights, can visualize them i. Especially true in the case of image as seen previously (~convolution filters) ii. Especially true for the top hidden layers since they often capture relatively large features 2. Visualize the most exciting input per neuron a. Treat each neuron as a feature detector. To find the feature a particular neuron is looking for, i. Feed a random input ii. Measure the activation of the neuron you are interested in iii. Perform to tweak the input so that the neuron will activate even more (gradient ascent) iv. Iterate several times

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems Roadmap

1. Introduction to Autoencoders 2. Sparse Autoencoders (SAE) (2008) 3. Denoising Autoencoders (DAE) (2008) 4. Contractive Autoencoders (CAE) (2011) 5. Stacked Convolutional Autoencoders (SCAE) (2011) 6. Recursive Autoencoders (RAE) (2011) 7. Variational Autoencoders (VAE) (2013) 8. Adversarial Autoencoders (AAE) (2015) 9. Wasserstein Autoencoders (WAE) (2017) 10. Autoencoders for Graphs Denoising Autoencoders (DAE) (2008) Sparse Coding Could Also Handle Image Denoising

Key: the use of sparse and redundant representations over trained dictionaries.

https://www.cs.ubc.ca/~schmidtm/MLRG/sparseCoding.pdf Elad, Michael, and Michal Aharon. "Image denoising via learned dictionaries and sparse representation." and , 2006 IEEE Computer Society Conference on. Vol. 1. IEEE, 2006. Denoising Autoencoders: Implementation-level

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems Denoising Autoencoders: Results

Gaussian noise

Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Python Denoising Autoencoders: Results

Salt and pepper noise

Pro Deep Learning with TensorFlow: A Mathematical Approach to Advanced Artificial Intelligence in Python Denoising Autoencoders: Research-level

Why equivalent to reconstruction loss

? (1) Intuitively (2) Recall Least Square Estimate is same as Maximum Likelihood Estimate under a Gaussian model

Deep Learning (Adaptive Computation and Machine Learning series) (Ian Goodfellow, Yoshua Bengio, Aaron Courville) Roadmap

1. Introduction to Autoencoders 2. Sparse Autoencoders (SAE) (2008) 3. Denoising Autoencoders (DAE) (2008) 4. Contractive Autoencoders (CAE) (2011) 5. Stacked Convolutional Autoencoders (SCAE) (2011) 6. Recursive Autoencoders (RAE) (2011) 7. Variational Autoencoders (VAE) (2013) 8. Adversarial Autoencoders (AAE) (2015) 9. Wasserstein Autoencoders (WAE) (2017) 10. Autoencoders for Graphs Contractive Autoencoders (CAE) (2011) CAE: Resist Infinitesimal Perturbations of Input

All autoencoder training procedures involve a compromise between two opposing forces: being data-specific and being data-insensitive.

CAE and DAE are equivalent under certain conditions.

Deep Learning (Adaptive Computation and Machine Learning series) (Ian Goodfellow, Yoshua Bengio, Aaron Courville) Roadmap

1. Introduction to Autoencoders 2. Sparse Autoencoders (SAE) (2008) 3. Denoising Autoencoders (DAE) (2008) 4. Contractive Autoencoders (CAE) (2011) 5. Stacked Convolutional Autoencoders (SCAE) (2011) 6. Recursive Autoencoders (RAE) (2011) 7. Variational Autoencoders (VAE) (2013) 8. Adversarial Autoencoders (AAE) (2015) 9. Wasserstein Autoencoders (WAE) (2017) 10. Autoencoders for Graphs Stacked Convolutional Autoencoders (SCAE) (2011) SCAE

Use convolutional + pooling layers instead of fully connected layers.

Dong, Chao, et al. "Image super-resolution using deep convolutional networks." IEEE transactions on pattern analysis and machine intelligence 38.2 (2016): 295-307. Image Deblurring/Denoising/Super-Resolution and Image Colorization

Dong, Chao, et al. "Image super-resolution using deep convolutional networks." IEEE transactions on pattern analysis and machine intelligence 38.2 (2016): 295-307. https://hackernoon.com/autoencoders-deep-learning-bits-1-11731e200694 Roadmap

1. Introduction to Autoencoders 2. Sparse Autoencoders (SAE) (2008) 3. Denoising Autoencoders (DAE) (2008) 4. Contractive Autoencoders (CAE) (2011) 5. Stacked Convolutional Autoencoders (SCAE) (2011) 6. Recursive Autoencoders (RAE) (2011) 7. Variational Autoencoders (VAE) (2013) 8. Adversarial Autoencoders (AAE) (2015) 9. Wasserstein Autoencoders (WAE) (2017) 10. Autoencoders for Graphs Recursive Autoencoders (RAE) (2011) Sentence Representation

Why not simple average? “white blood cells destroying an infection” ≠ “an infection destroying white blood cells”

Socher, Richard, et al. "Semi-supervised recursive autoencoders for predicting sentiment distributions." Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 2011. Sentence Representation Could penalize top-level nodes more heavily, which contain more children

Could predict all children underneath → unfolding RAE

Could introduce a supervised loss

Could normalize the hidden representations Could use many layers Could use parse tree

https://www.doc.ic.ac.uk/~js4416/163/website/nlp/recursive.html Socher, Richard, et al. "Dynamic pooling and unfolding recursive autoencoders for paraphrase detection." Advances in neural information processing systems. 2011. Roadmap

1. Introduction to Autoencoders 2. Sparse Autoencoders (SAE) (2008) 3. Denoising Autoencoders (DAE) (2008) 4. Contractive Autoencoders (CAE) (2011) 5. Stacked Convolutional Autoencoders (SCAE) (2011) 6. Recursive Autoencoders (RAE) (2011) 7. Variational Autoencoders (VAE) (2013) 8. Adversarial Autoencoders (AAE) (2015) 9. Wasserstein Autoencoders (WAE) (2017) 10. Autoencoders for Graphs Variational Autoencoders (VAE) (2013) Encoder Outputs Statistical Distributions; Feed Samples into Decoder → Add Noise at All Times; Generate New Data After Training VAE: Intuition

https://www.jeremyjordan.me/variational-autoencoders/ VAE: Implementation-level

Assume the prior distribution of z, i.e. p(z) to be Gaussian → encourage the learned posterior q(z|x) to be similar to p(z) through an Probabilistic (produce additional loss term and even after measuring their KL training) + generative divergence autoencoders -

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems Variational Bayesian Inference VAE: Research-level

generative model probabilistic decoder (generative model)

variational approximation latent probabilistic encoder representation (recognition model) or code to the intractable true posterior

Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." arXiv preprint arXiv:1312.6114 (2013). https://www.jeremyjordan.me/variational-autoencoders/ MusicVAE: Generative Model → Creative Artists

The desirable properties of a latent space can be summarized as follows:

1. Expression: Any real example can be mapped to some point in the latent space and reconstructed from it. 2. Realism: Any point in this space represents some realistic example, including ones not in the training set. 3. Smoothness: Examples from nearby points in latent space have similar qualities to one another.

https://experiments.withgoogle.com/ai/beat-blender/view/ https://magenta.tensorflow.org/music-vae Roberts, Adam, et al. "A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music." arXiv preprint arXiv:1803.05428 (2018). Key: Design Latent Space Properties

“holes” :(

Learn smooth latent state representations of the input data. Good for interpolation, sampling, generation, downstream https://www.jeremyjordan.me/variational-autoencoders/ classification, etc. Interpolation → Smooth Transformation

https://www.jeremyjordan.me/variational-autoencoders/ SketchRNN: Seq2seq +

Arithmetic operations on sketch embeddings!

Smoothness of latent space

sequence-to-sequence (seq2seq) autoencoder framework with variational inference sketch → sequence of motor actions controlling a pen (how about text or graph as a sequence?) by adding noise to the latent vector, the model cannot reproduce the input sketch exactly https://research.googleblog.com/2017/04/teaching-machines-to-draw.html Roadmap

1. Introduction to Autoencoders 2. Sparse Autoencoders (SAE) (2008) 3. Denoising Autoencoders (DAE) (2008) 4. Contractive Autoencoders (CAE) (2011) 5. Stacked Convolutional Autoencoders (SCAE) (2011) 6. Recursive Autoencoders (RAE) (2011) 7. Variational Autoencoders (VAE) (2013) 8. Adversarial Autoencoders (AAE) (2015) 9. Wasserstein Autoencoders (WAE) (2017) 10. Autoencoders for Graphs Adversarial Autoencoders (AAE) (2015) AAE: Regularized by An Adversarial Network Which Guides Posterior q(z|x) to Match Any Arbitrary Prior p(z)

VAE AAE

Prior p(z) Additional DKL loss term

Makhzani, Alireza, et al. "Adversarial autoencoders." arXiv preprint arXiv:1511.05644 (2015). AAE: Design Arbitrary Prior

Makhzani, Alireza, et al. "Adversarial autoencoders." arXiv preprint arXiv:1511.05644 (2015). AAE: Labels Can Further Guide (Semi-Supervised)

Makhzani, Alireza, et al. "Adversarial autoencoders." arXiv preprint arXiv:1511.05644 (2015). Roadmap

1. Introduction to Autoencoders 2. Sparse Autoencoders (SAE) (2008) 3. Denoising Autoencoders (DAE) (2008) 4. Contractive Autoencoders (CAE) (2011) 5. Stacked Convolutional Autoencoders (SCAE) (2011) 6. Recursive Autoencoders (RAE) (2011) 7. Variational Autoencoders (VAE) (2013) 8. Adversarial Autoencoders (AAE) (2015) 9. Wasserstein Autoencoders (WAE) (2017) 10. Autoencoders for Graphs Wasserstein Autoencoders (WAE) (2017) WAE: Motivation

VAE GAN

Pros: Pros:

1. Theoretically elegant 1. Good visual quality of images 2. Stable training 3. Encoder-decoder architecture Cons: 4. Nice latent manifold structure 1. Harder to train Cons: 2. No encoder; only a decoder/generator and a discriminator 1. Tend to generate blurry samples 3. “Mode collapse” problem 4. ~JS divergence, “worse” than Wasserstein distance (see details in the paper)

Tolstikhin, Ilya, et al. "Wasserstein Auto-Encoders." arXiv preprint arXiv:1711.01558 (2017). Combine VAE + GAN in A Principled Way?

VAE GAN encoder decoder decoder/ generator

Prior p(z) Additional DKL loss term

discriminator

Tolstikhin, Ilya, et al. "Wasserstein Auto-Encoders." arXiv preprint arXiv:1711.01558 (2017). WAE

A generalization of AAE; minimizes Wasserstein distance between the model and the target distribution.

AAE

Tolstikhin, Ilya, et al. "Wasserstein Auto-Encoders." arXiv preprint arXiv:1711.01558 (2017). https://openreview.net/forum?id=HkL7n1-0b Roadmap

1. Introduction to Autoencoders 2. Sparse Autoencoders (SAE) (2008) 3. Denoising Autoencoders (DAE) (2008) 4. Contractive Autoencoders (CAE) (2011) 5. Stacked Convolutional Autoencoders (SCAE) (2011) 6. Recursive Autoencoders (RAE) (2011) 7. Variational Autoencoders (VAE) (2013) 8. Adversarial Autoencoders (AAE) (2015) 9. Wasserstein Autoencoders (WAE) (2017) 10. Autoencoders for Graphs Autoencoders for Graphs Graphs Are Different

1. Are there smooth linear interpolations? Arithmetic operations? 2. Graph is composed of correlated substructures a. E.g. Two triangles → rectangle b. Hierarchy: pixels (atomic) → patterns → images; words (atomic) → phrases → sentences → paragraphs/documents; nodes (atomic) → substructures → graphs (transfer learning) 3. Graphs are of different sizes 4. Graph nodes lack order 5. How to detect substructures? a. For image, convolutional layers → SCAE b. For graph, graph convolutional layers → node/substructure/graph? c. Some people treat graph as sequences/random walks → “deconstruction” view i. ~Parse sentences into trees instead of feeding into LSTM d. How about decompose graphs into equal-size subgraphs? GraphVAE

Simonovsky, Martin, and Nikos Komodakis. "GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders." arXiv preprint arXiv:1802.03480 (2018). Simonovsky, Martin, and Nikos Komodakis. "GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders." arXiv preprint arXiv:1802.03480 (2018). Thank you!