CSCE 479/879 Lecture 5: Autoencoders Stephen Scott CSCE 479/879 Lecture 5

Home , Ian Goodfellow

CSCE 479/879 Lecture 5: Autoencoders Stephen Scott CSCE 479/879 Lecture 5:

Introduction Autoencoders

Basic Idea

Stacked AE Transposed Stephen Scott Convolutions

Denoising AE (Adapted from Eleanor Quint and Ian Goodfellow) Sparse AE

Contractive AE

Variational AE t-SNE

GAN [email protected]

1 / 41 Introduction

CSCE 479/879 Lecture 5: Autoencoders

Stephen Scott

Introduction Autoencoding is training a network to replicate its

Basic Idea input to its output Stacked AE Applications: Transposed Unlabeled pre-training for semi-supervised learning Convolutions Learning embeddings to support information retrieval Denoising AE Generation of new instances similar to those in the Sparse AE training set Contractive AE Data compression

Variational AE t-SNE

GAN

2 / 41 Outline

CSCE 479/879 Lecture 5: Autoencoders

Stephen Scott

Introduction Basic idea Basic Idea Stacking Stacked AE Types of autoencoders Transposed Convolutions Denoising

Denoising AE Sparse

Sparse AE Contractive

Contractive Variational AE Generative adversarial networks Variational AE t-SNE

GAN

3 / 41 Basic Idea (Mitchell, 1997)

CSCE 479/879 Lecture 5: Autoencoders

Stephen Scott

Introduction

Basic Idea

Stacked AE

Transposed Convolutions

Denoising AE

Sparse AE

Contractive AE Sigmoid activation functions, 5000 training epochs, Variational AE square loss, no regularization t-SNE What’s special about the hidden layer outputs? GAN

4 / 41 Basic Idea

CSCE 479/879 Lecture 5: An autoencoder is a network trained to learn the Autoencoders identity function: output = input Stephen Scott

Introduction Subnetwork called Basic Idea encoder f (·) maps input Stacked AE to an embedded Transposed Convolutions representation Denoising AE Subnetwork called Sparse AE decoder g(·) maps back Contractive AE to input space

Variational AE t-SNE Can be thought of as lossy compression of input GAN Need to identify the important attributes of inputs to reproduce faithfully

5 / 41 Basic Idea

CSCE 479/879 Lecture 5: Autoencoders Stephen Scott General types of autoencoders based on size of hidden

Introduction layer

Basic Idea Undercomplete autoencoders have hidden layer size

Stacked AE smaller than input layer size

Transposed ⇒ Dimension of embedded space lower than that of input Convolutions space Denoising AE ⇒ Cannot simply memorize training instances Sparse AE Overcomplete autoencoders have much larger hidden Contractive layer sizes AE ⇒ Regularize to avoid overﬁtting, e.g., enforce a sparsity Variational AE constraint t-SNE

GAN

6 / 41 Basic Idea Example: Principal Component Analysis

CSCE 479/879 Lecture 5: Autoencoders

Stephen Scott

Introduction

Basic Idea

Stacked AE

Transposed Convolutions

Denoising AE

Sparse AE

Contractive AE A 3-2-3 autoencoder with linear units and square loss Variational AE performs principal component analysis: Find linear t-SNE transformation of data to maximize variance GAN

7 / 41 Stacked Autoencoders

CSCE 479/879 Lecture 5: Autoencoders A stacked Stephen Scott autoencoder Introduction has multiple Basic Idea hidden layers Stacked AE

Transposed Convolutions Denoising AE Can share parameters to reduce their number by Sparse AE > > exploiting symmetry: W4 = W1 and W3 = W2 Contractive AE

Variational AE weights1 = tf.Variable(weights1_init, dtype=tf.float32, name="weights1") weights2 = tf.Variable(weights2_init, dtype=tf.float32, name="weights2") t-SNE weights3 = tf.transpose(weights2, name="weights3") # shared weights weights4 = tf.transpose(weights1, name="weights4") # shared weights GAN

8 / 41 Stacked Autoencoders Incremental Training

CSCE 479/879 Lecture 5: Autoencoders

Stephen Scott

Introduction

Basic Idea

Stacked AE

Transposed Convolutions

Denoising AE

Sparse AE

Contractive Can simplify training by starting with single hidden AE layer H1 Variational AE Then, train a second AE to mimic the output of H1 t-SNE

GAN Insert this into ﬁrst network Can build by using H1’s output as training set for Phase 2 9 / 41 Stacked Autoencoders Incremental Training (Single TF Graph)

CSCE 479/879 Lecture 5: Autoencoders

Stephen Scott

Introduction

Basic Idea

Stacked AE

Transposed Convolutions

Denoising AE

Sparse AE

Contractive AE

Variational AE t-SNE Previous approach requires multiple TensorFlow graphs GAN Can instead train both phases in a single graph: First left side, then right

10 / 41 Stacked Autoencoders Visualization

CSCE 479/879 Input MNIST Digit Network Output Lecture 5: Autoencoders

Stephen Scott

Introduction

Basic Idea

Stacked AE

Transposed Convolutions

Denoising AE

Sparse AE

Contractive AE Variational AE Weights (features selected) for ﬁve nodes from H1: t-SNE

GAN

11 / 41 Stacked Autoencoders Semi-Supervised Learning

CSCE 479/879 Lecture 5: Autoencoders

Stephen Scott

Introduction

Basic Idea

Stacked AE

Transposed Convolutions

Denoising AE

Sparse AE

Contractive AE

Variational AE t-SNE

GAN Can pre-train network with unlabeled data ⇒ learn useful features and then train “logic” of dense layer with labeled data 12 / 41 TF code Transfer Learning from Trained Classiﬁer

CSCE 479/879 Lecture 5: Autoencoders Can also Stephen Scott transfer from a

Introduction classiﬁer Basic Idea trained on Stacked AE different task, Transposed e.g., transfer a Convolutions

Denoising AE GoogleNet

Sparse AE architecture to

Contractive ultrasound AE classiﬁcation Variational AE t-SNE GAN Often choose existing one from a model zoo

13 / 41 Transposed Convolutions

CSCE 479/879 Lecture 5: Autoencoders

Stephen Scott

Introduction What if some encoder layers are convolutional? How to Basic Idea upsample to original resolution? Stacked AE

Transposed Can use, e.g., linear interpolation, bilinear Convolutions interpolation, etc. Denoising AE Or, transposed convolution, e.g., Sparse AE tf.layers.conv2d transpose Contractive AE

Variational AE t-SNE

GAN

14 / 41 Transposed Convolutions (2)

CSCE 479/879 Lecture 5: Autoencoders Stephen Scott Consider this example convolution

Introduction

Basic Idea

Stacked AE

Transposed Convolutions

Denoising AE

Sparse AE

Contractive AE

Variational AE t-SNE

GAN

15 / 41 Transposed Convolutions (3)

CSCE 479/879 Lecture 5: Autoencoders An alternative way of representing the kernel Stephen Scott

Introduction

Basic Idea

Stacked AE

Transposed Convolutions

Denoising AE

Sparse AE

Contractive AE

Variational AE t-SNE

GAN

16 / 41 Transposed Convolutions (4)

CSCE 479/879 This representation works with matrix multiplication on Lecture 5: Autoencoders ﬂattened input:

Stephen Scott

Introduction

Basic Idea

Stacked AE

Transposed Convolutions

Denoising AE

Sparse AE

Contractive AE

Variational AE t-SNE

GAN

17 / 41 Transposed Convolutions (5)

CSCE Transpose kernel, multiply by ﬂat × to get ﬂat × 479/879 2 2 4 4 Lecture 5: Autoencoders

Stephen Scott

Introduction

Basic Idea

Stacked AE

Transposed Convolutions

Denoising AE

Sparse AE

Contractive AE

Variational AE t-SNE

GAN

18 / 41 Denoising Autoencoders Vincent et al. (2010)

CSCE 479/879 Lecture 5: Autoencoders Stephen Scott Can train an autoencoder to learn to denoise input by Introduction giving input corrupted instance ˜x and targeting Basic Idea uncorrupted instance x Stacked AE Example noise models: Transposed 2 Convolutions Gaussian noise: ˜x = x + z, where z ∼ N (0, σ I)

Denoising AE Masking noise: zero out some fraction ν of

Sparse AE components of x

Contractive Salt-and-pepper noise: choose some fraction ν of AE components of x and set each to its min or max value Variational AE (equally likely) t-SNE

GAN

19 / 41 Denoising Autoencoders

CSCE 479/879 Lecture 5: Autoencoders

Stephen Scott

Introduction

Basic Idea

Stacked AE

Transposed Convolutions

Denoising AE

Sparse AE

Contractive AE

Variational AE t-SNE

GAN

20 / 41 Denoising Autoencoders Example

CSCE 479/879 Lecture 5: Autoencoders

Stephen Scott

Introduction

Basic Idea

Stacked AE

Transposed Convolutions

Denoising AE

Sparse AE

Contractive AE

Variational AE t-SNE

GAN

21 / 41 Denoising Autoencoders

CSCE 479/879 How does it work? Lecture 5: Autoencoders Even though, e.g., MNIST data are in a Stephen Scott 784-dimensional space, they lie on a low-dimensional manifold that captures their most important features Introduction Corruption process moves instance x off of manifold Basic Idea 0 Stacked AE Encoder fθ and decoder gθ are trained to project ˜x back

Transposed onto manifold Convolutions

Denoising AE

Sparse AE

Contractive AE

Variational AE t-SNE

GAN

22 / 41 Sparse Autoencoders

CSCE 479/879 An overcomplete architecture Lecture 5: Autoencoders Regularize outputs of hidden layer to enforce sparsity: Stephen Scott J˜(x) = J (x, g(f (x))) + α Ω(h) , Introduction where J is loss function, f is encoder, g is decoder, Basic Idea h = f (x), and Ω penalizes non-sparsity of h Stacked AE P Transposed E.g., can use Ω(h) = i |hi| and ReLU activation to Convolutions force many zero outputs in hidden layer Denoising AE Can also measure average activation of hi across Sparse AE mini-batch and compare it to user-speciﬁed target Contractive AE sparsity value p (e.g., 0.1) via square error or Variational AE Kullback-Leibler divergence: t-SNE p 1 − p GAN p log + (1 − p) log , q 1 − q

where q is average activation of hi over mini-batch 23 / 41 Contractive Autoencoders

CSCE 479/879 Lecture 5: Autoencoders Similar to sparse autoencoder, but use

Stephen Scott m n 2 X X ∂hi Introduction Ω(h) = ∂xj Basic Idea j=1 i=1 Stacked AE

Transposed I.e., penalize large partial derivatives of encoder Convolutions outputs wrt input values Denoising AE This contracts the output space by mapping input Sparse AE points in a neighborhood near x to a smaller output Contractive AE neighborhood near f (x) Variational AE ⇒ Resists perturbations of input x t-SNE If h has sigmoid activation, encoding near binary and a GAN CE pushes embeddings to corners of a hypercube

24 / 41 Variational Autoencoders

CSCE 479/879 Lecture 5: Autoencoders VAE is an autoencoder that is also generative model Stephen Scott ⇒ Can generate new instances according to a probability

Introduction distribution

Basic Idea E.g., hidden Markov models, Bayesian networks

Stacked AE Contrast with discriminative models, which predict classiﬁcations Transposed Convolutions > Denoising AE Encoder f outputs [µ, σ]

Sparse AE Pair (µi, σi) parameterizes Contractive Gaussian distribution for AE dimension i = 1,..., n Variational AE Draw zi ∼ N (µi, σi) t-SNE Decode this latent variable z GAN to get g(z)

25 / 41 Variational Autoencoders Latent Variables

CSCE 479/879 Lecture 5: Autoencoders Independence of z dimensions makes it easy to Stephen Scott generate instances wrt complex distributions via

Introduction decoder g Basic Idea Latent variables can be thought of as values of Stacked AE attributes describing inputs Transposed Convolutions E.g., for MNIST, latent variables might represent “thickness”, “slant”, “loop closure” Denoising AE

Sparse AE

Contractive AE

Variational AE t-SNE

GAN

26 / 41 Variational Autoencoders Architecture

CSCE 479/879 Lecture 5: Autoencoders

Stephen Scott

Introduction

Basic Idea

Stacked AE

Transposed Convolutions

Denoising AE

Sparse AE

Contractive AE

Variational AE t-SNE

GAN

27 / 41 Variational Autoencoders Optimization

CSCE 479/879 Maximum likelihood (ML) approach for training Lecture 5: Autoencoders generative models: ﬁnd a model (θ) with maximum Stephen Scott probability of generating the training set X

Introduction Achieve this by minimizing the sum of: Basic Idea End-to-end AE loss (e.g., square, cross-entropy) Stacked AE Regularizer measuring distance (K-L divergence) from Transposed latent distribution q(z | x) and N (0, I) (= standard Convolutions multivariate Gaussian) Denoising AE

Sparse AE N (0, I) also considered the prior distribution over z (=

Contractive distribution when no x is known) AE

Variational AE t-SNE eps = 1e-10 GAN latent_loss = 0.5 * tf.reduce_sum( tf.square(hidden3_sigma) + tf.square(hidden3_mean) - 1 - tf.log(eps + tf.square(hidden3_sigma)))

28 / 41 Variational Autoencoders Reparameterization Trick

CSCE 479/879 Lecture 5: Cannot backprop error signal through random samples Autoencoders

Stephen Scott Reparameterization trick emulates z ∼ N (µ, σ) with ∼ N (0, 1), z = σ + µ Introduction

Basic Idea

Stacked AE

Transposed Convolutions

Denoising AE

Sparse AE

Contractive AE

Variational AE t-SNE

GAN

29 / 41 Variational Autoencoders Example Generated Images: Random

CSCE 479/879 Lecture 5: Autoencoders Draw z ∼ N (0, I) and display g(z)

Stephen Scott

Introduction

Basic Idea

Stacked AE

Transposed Convolutions

Denoising AE

Sparse AE

Contractive AE

Variational AE t-SNE

GAN

30 / 41 Variational Autoencoders Example Generated Images: Manifold

CSCE 479/879 Uniformly sample points in (2-dimensional) z space and Lecture 5: Autoencoders decode Stephen Scott

Introduction

Basic Idea

Stacked AE

Transposed Convolutions

Denoising AE

Sparse AE

Contractive AE

Variational AE t-SNE

GAN

31 / 41 Variational Autoencoders 2D Cluster Analysis

CSCE 479/879 Lecture 5: Cluster analysis by digit (2D latent space) Autoencoders

Stephen Scott

Introduction

Basic Idea

Stacked AE

Transposed Convolutions

Denoising AE

Sparse AE

Contractive AE

Variational AE t-SNE

GAN

32 / 41 Aside: Visualizing with t-SNE van der Maaten and Hinton (2008)

CSCE 479/879 Visualize high-dimensional data, e.g., embedded Lecture 5: Autoencoders representations Stephen Scott Want low-dimensional representation to have similar neighborhoods as high-dimensional one Introduction Map each high-dimensional x1,..., xN to Basic Idea low-dimensional y1,..., yN via matching pairwise Stacked AE distributions based on distance Transposed Convolutions ⇒ Probability pij pair (xi, xj) chosen similar to probability qij

Denoising AE pair (yi, yj) chosen Sparse AE Set pij = (pj|i + pi|j)/(2N) where Contractive 2 2 AE exp −kxi − xjk /(2σ ) p = i Variational AE j|i P 2 2 k6=i exp −kxi − xkk /(2σi ) t-SNE

GAN and σi chosen to control density of the distribution I.e., pj|i is probability of xi choosing xj as its neighbor if chosen in proportion of Gaussian density centered at xi 33 / 41 Aside: Visualizing with t-SNE (2) van der Maaten and Hinton (2008)

CSCE 479/879 Lecture 5: Autoencoders Also, deﬁne q via student’s t distribution:

Stephen Scott 2−1 1 + kyi − yjk Introduction qij = P 2 −1 Basic Idea k6=` (1 + kyk − y`k ) Stacked AE

Transposed Using student’s t instead of Gaussian helps address Convolutions crowding problem where distant clusters in x space Denoising AE squeeze together in y space Sparse AE

Contractive Now choose y values to match distributions p and q via AE Kullback-Leibler divergence: Variational AE t-SNE X pij pij log GAN qij i6=j

34 / 41 Generative Adversarial Network

CSCE 479/879 GANs are also generative models, like VAEs Lecture 5: Autoencoders Models a game between two players

Stephen Scott Generator creates samples intended to come from training distribution Introduction Discriminator attempts to discern the “real” (original Basic Idea training) samples from the “fake” (generated) ones Stacked AE Discriminator trains as a binary classiﬁer, generator Transposed Convolutions trains to fool the discriminator Denoising AE

Sparse AE

Contractive AE

Variational AE t-SNE

GAN

35 / 41 Generative Adversarial Network How the Game Works

CSCE 479/879 Lecture 5: Autoencoders

Stephen Scott (D) Introduction Let D(x) be discriminator parameterized by θ (D) (D) (D) (G) Basic Idea Goal: Find θ minimizing J θ , θ Stacked AE Let G(z) be generator parameterized by θ(G) Transposed (G) (G) (D) (G) Convolutions Goal: Find θ minimizing J θ , θ (D) (G) Denoising AE A Nash equilibrium of this game is θ , θ such Sparse AE that each θ(i), i ∈ {D, G} yields a local minimum of its Contractive AE corresponding J

Variational AE t-SNE

GAN

36 / 41 Generative Adversarial Network Training

CSCE 479/879 Lecture 5: Autoencoders

Stephen Scott Each training step: Draw a minibatch of x values from dataset Introduction Draw a minibatch of z values from prior (e.g., N (0, I)) Basic Idea Simultaneously update θ(G) to reduce J(G) and θ(D) to Stacked AE reduce J(D), via, e.g., Adam Transposed Convolutions For J(D), common to use cross-entropy where label is 1 Denoising AE for real and 0 for fake Sparse AE Since generator wants to trick discriminator, can use Contractive (G) (D) AE J = −J Variational AE Others exist that are generally better in practice, e.g., t-SNE based on ML GAN

37 / 41 Generative Adversarial Network DCGAN: Radford et al. (2015)

CSCE 479/879 Lecture 5: Autoencoders “Deep, convolution GAN” Stephen Scott Generator uses transposed convolutions (e.g., Introduction tf.layers.conv2d_transpose) without pooling to Basic Idea

Stacked AE upsample images for input to discriminator

Transposed Convolutions

Denoising AE

Sparse AE

Contractive AE

Variational AE t-SNE

GAN

38 / 41 Generative Adversarial Network DCGAN Generated Images: Bedrooms

CSCE z 479/879 Trained from LSUN dataset, sampled space Lecture 5: Autoencoders

Stephen Scott

Introduction

Basic Idea

Stacked AE

Transposed Convolutions

Denoising AE

Sparse AE

Contractive AE

Variational AE t-SNE

GAN

39 / 41 Generative Adversarial Network DCGAN Generated Images: Adele Facial Expressions

CSCE 479/879 Trained from frame grabs of interview, sampled z space Lecture 5: Autoencoders

Stephen Scott

Introduction

Basic Idea

Stacked AE

Transposed Convolutions

Denoising AE

Sparse AE

Contractive AE

Variational AE t-SNE

GAN

40 / 41 Generative Adversarial Network DCGAN Generated Images: Latent Space Arithmetic

CSCE 479/879 Lecture 5: Autoencoders Performed semantic arithmetic in z space! Stephen Scott

Introduction

Basic Idea

Stacked AE

Transposed Convolutions

Denoising AE

Sparse AE

Contractive AE

Variational AE t-SNE (Non-center images have noise added in z space; center is GAN noise-free)

41 / 41