<<

Lecture 15

A Short Introduction to Autoencoders

STAT 479: , Spring 2019 Sebastian Raschka http://stat.wisc.edu/~sraschka/teaching/stat479-ss2019/

Sebastian Raschka STAT 479: Deep Learning SS 2019 1

Working with datasets without considering a/the target variable

Some Applications and Goals: • Finding hidden structures in data • • Clustering • Retrieving similar objects • Exploratory Data Analysis • Generating new examples

Sebastian Raschka STAT 479: Deep Learning SS 2019 2 Principal Component Analysis (PCA)

1) Find directions of maximum variance

x 2 PC2

PC1 PC2 PC1

x2 PC2

x 1

PC1

Sebastian Raschka STAT 479: Deep Learning SS 2019 3 Principal Component Analysis (PCA)

2) Transform features onto directions of maximum variance

x 2 x PC2 2 PC2 PC1 PC2 PC1 PC2 PC1 PC1 x2 PC2 x2 PC2

x 1 x 1

PC1 PC1

Sebastian Raschka STAT 479: Deep Learning SS 2019 4 Principal Component Analysis (PCA)

3) Usually consider a subset of vectors of most variance ()

x 2 PC2

PC1 PC2 PC1 PC1 PC2 PC1 x2 PC2 x2

x 1

PC1

Sebastian Raschka STAT 479: Deep Learning SS 2019 5 Principal Component Analysis (PCA)

Another view on PCA is minimizing the squared offsets

PC1

x2

Note in least-squares , we minimize the vertical offsets

x1

Sebastian Raschka STAT 479: Deep Learning SS 2019 6 A Basic Fully-Connected (Multilayer-) Autoencoder

Encoder Decoder

hidden units / embedded space / latent space / bottleneck Inputs Outputs = reconstructed inputs

Sebastian Raschka STAT 479: Deep Learning SS 2019 7 A Basic Fully-Connected (Multilayer-Perceptron) Autoencoder 2 2 (x, x0)= x x0 = (x x0 ) L || ||2 i i

AAACSXicbZBLTwIxFIU74APxhbp000gMkCiZQRPdmBDduHCBiTwSwLFTOtDQeaTtGMgwf8+NO3f+BzcuNMaVHSCK6E2afDnnNO09ls+okLr+rCWSC4tLy6mV9Ora+sZmZmu7JryAY1LFHvN4w0KCMOqSqqSSkYbPCXIsRupW/yL26/eEC+q5N3Lok7aDui61KUZSSWbmruUg2cOIhVdRfsyWHQ6iA/jNuagAzyAcjX5ceDhrj0a3JbMUZ1oicEyaH5hUJQY5kxaUk8nqRX088C8YU8iC6VTMzFOr4+HAIa7EDAnRNHRftkPEJcWMROlWIIiPcB91SVOhixwi2uG4iQjuK6UDbY+r40o4VmdvhMgRYuhYKhkvIOa9WPzPawbSPm2H1PUDSVw8ecgOGJQejGuFHcoJlmyoAGFO1V8h7iGOsFTlp1UJxvzKf6FWKhpHxdL1cbZ8Pq0jBXbBHsgDA5yAMrgEFVAFGDyAF/AG3rVH7VX70D4n0YQ2vbMDfk0i+QXbIbDz i X

Encoder Decoder

If we don't use However, the latent non-linear activation dimensions will not functions and necessarily be minimize the MSE, orthogonal this is very similar and will have to PCA ~ same variance hidden units / embedded space / latent space / Inputs bottleneck Outputs

Sebastian Raschka STAT 479: Deep Learning SS 2019 8 A Basic Fully-Connected (Multilayer-Perceptron) Autoencoder

Question: If we can achieve the same with PCA, which is essentially a kind of matrix factorization that is more efficient than Backprop + SGD, why bother with autoencoders?

Sebastian Raschka STAT 479: Deep Learning SS 2019 9 Potential Autoencoder Applications

After training, disregard this part

Use embedding as input to classic methods (SVM, KNN, , ...) Or, similar to transfer learning, train autoencoder on large image dataset, then fine tune encoder part on your own, smaller datset and/or provide your own output (classification) Latent space can also be used for visualization (EDA, clustering), but there are better methods for that

Sebastian Raschka STAT 479: Deep Learning SS 2019 10 t-Distributed Stochastic Neighbor Embedding (t-SNE)

Another wayVAN DtoER M learnAATEN AN DembeddingsHINTON ... (t-SNE is only meant for visualization not for preparing datasets!)

0 1 2 3 4 5 6 Note that MNIST has 7 28 x 28 = 784 dimensions 8 9

For details, see https://github.com/rasbt/ stat479-machine-learning-fs18/ blob/master/14_feat-extract/ 14_feat-extract_slides.pdf (a) Visualization by t-SNE. Shown are 6000 images from MNIST projected in 2D

Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(Nov), 2579-2605.

Sebastian Raschka STAT 479: Deep Learning SS 2019 11

(b) Visualization by Sammon mapping. Figure 2: Visualizations of 6,000 handwritten digits from the MNIST data set.

2590 A Simple Autoencoder

fully connected layer fully connected layer + sigmoid Reshape + leaky relu Reshape 784 => 32 32 => 784 28*28 => 784 784 => 28*28 32 dim

Encoder Decoder

https://github.com/rasbt/stat479-deep-learning-ss19/blob/master/ L15_autoencoder/code/ae-simple.ipynb original reconstructed

Sebastian Raschka STAT 479: Deep Learning SS 2019 12 A Convolutional Autoencoder

1 or more 1 or more convolutional layers "de"convolutional layers

Encoder Decoder

https://github.com/rasbt/stat479-deep-learning-ss19/blob/master/ L15_autoencoder/code/ae-conv.ipynb original reconstructed

Sebastian Raschka STAT 479: Deep Learning SS 2019 13 Transposed

• Allows us to increase the size of the output feature map compared to the input feature map

• Synonyms: ‣ often also (incorrectly) called "deconvolution" or sometimes (mathematically, deconvolution is defined as the inverse of convolution, which is different from transposed ) ‣ the term "unconv" is sometimes also used ‣ fractionally strided convolution is another term for that

Sebastian Raschka STAT 479: Deep Learning SS 2019 14 Regular Convolution:

output

input

Figure 2.1: (No padding, unit strides) Convolving a 3 3 kernel over a 4 4 ⇥ ⇥ input using unit strides (i.e., i =4, k =3, s =1and p =0).

Transposed Convolution (emulated with direct convolution):

output

Figure 2.2: (Arbitrary padding, unit strides) Convolving a 4 4 kernel over a ⇥ input 5 5 input padded with a 2 2 border of zeros using unit strides (i.e., i =5, ⇥ ⇥ k =4, s =1and p =2).

Figure 4.1: The transpose of convolving a 3 3 kernel over a 4 4 input using ⇥ ⇥ unit strides (i.e., i =4, k =3, s =1and p =0). It is equivalent to convolving Dumoulin, Vincent,a 3 3andkernel Francesco over aVisin.2 "2A inputguide to padded convolution with arithmetic a 2 2 forborder deep oflearning zeros." usingarXiv preprint unit ⇥ ⇥ ⇥ arXiv:1603.07285strides (2016). (i.e., i0 =2, k0 = k, s0 =1and p0 =2). Sebastian Raschka STAT 479: Deep Learning SS 2019 15

Figure 2.3: (Half padding, unit strides) Convolving a 3 3 kernel over a 5 5 ⇥ ⇥ input using half padding and unit strides (i.e., i =5, k =3, s =1and p =1).

Figure 4.2: The transpose of convolving a 4 4 kernel over a 5 5 input padded ⇥ ⇥ with a 2 2 border of zeros using unit strides (i.e., i =5, k =4, s =1and ⇥ p =2). It is equivalent to convolving a 4 4 kernel over a 6 6 input padded Figure 2.4: (Full padding, unit strides) Convolving⇥ a 3 3 kernel over⇥ a 5 5 with a 1 1 border of zeros using unit strides (i.e.,⇥ i0 =6, k0 = ⇥k, s0 =1and input using full⇥ padding and unit strides (i.e., i =5, k =3, s =1and p =2). p0 =1).

14

Figure 4.3: The transpose of convolving a 3 3 kernel over a 5 5 input using ⇥ ⇥ half padding and unit strides (i.e., i =5, k =3, s =1and p =1). It is equivalent to convolving a 3 3 kernel over a 5 5 input using half padding ⇥ ⇥ and unit strides (i.e., i0 =5, k0 = k, s0 =1and p0 =1).

23 Transposed Convolution

output = s(n 1) + k 2p AAACBnicbVDLSgMxFM3UV62vUZciBItQkZaZKuhGKLpxWcE+oB1KJk3b0ExmSO6IZejKjb/ixoUibv0Gd/6NaTsLrR4IHM65l5tz/EhwDY7zZWUWFpeWV7KrubX1jc0te3unrsNYUVajoQhV0yeaCS5ZDTgI1owUI4EvWMMfXk38xh1TmofyFkYR8wLSl7zHKQEjdez9NrB7wAkOY4hiwOMLXZBF9+h4WCzjqGPnnZIzBf5L3JTkUYpqx/5sd0MaB0wCFUTrlutE4CVEAaeCjXPtWLOI0CHps5ahkgRMe8k0xhgfGqWLe6EyTwKeqj83EhJoPQp8MxkQGOh5byL+57Vi6J17CZcmIJN0dqgXCwwhnnSCu1wxCmJkCKGKm79iOiCKUDDN5UwJ7nzkv6ReLrknpfLNab5ymdaRRXvoABWQi85QBV2jKqohih7QE3pBr9aj9Wy9We+z0YyV7uyiX7A+vgGq3ZdN

?

Sebastian Raschka STAT 479: Deep Learning SS 2019 16 Transposed Convolution

strides: in transposed convolutions, we stride over the output; Figure 4.4: The transpose of convolving a 3 3 kernel over a 5 5 input using ⇥ ⇥ fullhence, padding larger and unit stridesstrides (i.e., willi =5, kresult=3, s =1 inand largerp =2). outputs It is equivalent to convolving a 3 3 kernel over a 7 7 input using unit strides (i.e., i0 =7, (opposite to⇥ regular convolutions)⇥ k0 = k, s0 =1and p0 =0). You can think of it as this:

Figure 4.5: The transpose of convolving a 3 3 kernel over a 5 5 input using ⇥ ⇥ 2 2 strides (i.e., i =5, k =3, s =2and p =0). It is equivalent to convolving ⇥ a 3 3 kernel over a 2 2 input (with 1 zero inserted between inputs) padded ⇥ ⇥ with a 2 2 border of zeros using unit strides (i.e., i0 =2, ˜i0 =3, k0 = k, s0 =1 ⇥ and p0 =2).

Dumoulin, Vincent, and Francesco Visin. "A guide to convolution arithmetic for deep learning." arXiv preprint arXiv:1603.07285 (2016). Sebastian Raschka STAT 479: Deep Learning SS 2019 17

Figure 4.6: The transpose of convolving a 3 3 kernel over a 5 5 input padded ⇥ ⇥ with a 1 1 border of zeros using 2 2 strides (i.e., i =5, k =3, s =2and ⇥ ⇥ p =1). It is equivalent to convolving a 3 3 kernel over a 3 3 input (with ⇥ ⇥ 1 zero inserted between inputs) padded with a 1 1 border of zeros using unit ⇥ strides (i.e., i0 =3, ˜i0 =5, k0 = k, s0 =1and p0 =1).

25 https://distill.pub/2016/deconv-checkerboard/

A good interactive article highlighting the dangers of transposed conv.

In short, it is recommended to replace transposed conv. by upsampling (interpolation) followed by regular convolution

Sebastian Raschka STAT 479: Deep Learning SS 2019 18 A Convolutional Autoencoder with Nearest Neighbor Interpolation

Convolutional layers Convolutional layers + maxpooling + interpolation

Encoder Decoder

https://github.com/rasbt/stat479-deep-learning-ss19/blob/master/ L15_autoencoder/code/ae-interpolate.ipynb original reconstructed

Sebastian Raschka STAT 479: Deep Learning SS 2019 19 Figure 4.4: The transpose of convolving a 3 3 kernel over a 5 5 input using ⇥ ⇥ full padding and unit strides (i.e., i =5, k =3, s =1and p =2). It is equivalent to convolving a 3 3 kernel over a 7 7 input using unit strides (i.e., i0 =7, ⇥ ⇥ k0 = k, s0 =1and p0 =0).

Transposed Convolution padding: in transposed convolutions, we pad the output; hence,

Figurelarger 4.5: padding The transpose will ofresult convolving in smaller a 3 3 kernel output over a 5maps5 input using ⇥ ⇥ 2 2 strides (i.e., i =5, k =3, s =2and p =0). It is equivalent to convolving (opposite⇥ to regular convolutions) a 3 3 kernel over a 2 2 input (with 1 zero inserted between inputs) padded ⇥ ⇥ with a 2 2 border of zeros using unit strides (i.e., i0 =2, ˜i0 =3, k0 = k, s0 =1 ⇥ and p0 =2). for a 2x2 input (previous slide) the output would be 3x3

Figure 4.6: The transpose of convolving a 3 3 kernel over a 5 5 input padded ⇥ ⇥ with a 1 1 border of zeros using 2 2 strides (i.e., i =5, k =3, s =2and ⇥ ⇥ p =1). It is equivalent to convolving a 3 3 kernel over a 3 3 input (with ⇥ ⇥ 1 zero inserted between inputs) padded with a 1 1 border of zeros using unit ⇥ strides (i.e., i0 =3, ˜i0 =5, k0 = k, s0 =1and p0 =1). Dumoulin, Vincent, and Francesco Visin. "A guide to convolution arithmetic for deep learning." arXiv preprint arXiv:1603.07285 (2016). Sebastian Raschka 25 STAT 479: Deep Learning SS 2019 20 Side-note: Often you will also see binary cross entropy used as a instead of MSE

I find BCE for the autoencoder reconstruction loss problematic, but it does seem to work well in practice ... (because it is not symmetric)

Sebastian Raschka STAT 479: Deep Learning SS 2019 21 Autoencoders and Dropout

Add dropout layers to force networks to learn redundant features

Encoder Decoder

Sebastian Raschka STAT 479: Deep Learning SS 2019 22 Denoising Autoencoder

Add dropout after the input, or add to the input to learn to denoise images

Encoder Decoder

Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P. A. (2008, July). Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning (pp. 1096-1103). ACM.

Sebastian Raschka STAT 479: Deep Learning SS 2019 23 Sparse Autoencoder

Add L1 penalty to the loss to learn sparse feature representations Enc (x) | i |

AAACBHicbVDLSsNAFJ3UV62vqMtuBotQNyWpgi6LIrisYB/QhDCZTtqhk0mYmYgl7cKNv+LGhSJu/Qh3/o2TNgttPXDhcM693HuPHzMqlWV9G4WV1bX1jeJmaWt7Z3fP3D9oyygRmLRwxCLR9ZEkjHLSUlQx0o0FQaHPSMcfXWV+554ISSN+p8YxcUM04DSgGCkteWbZkUnoUTi55tijVSdEaugH6cP0BE48s2LVrBngMrFzUgE5mp755fQjnISEK8yQlD3bipWbIqEoZmRachJJYoRHaEB6mnIUEummsyem8FgrfRhEQhdXcKb+nkhRKOU49HVndqRc9DLxP6+XqODCTSmPE0U4ni8KEgZVBLNEYJ8KghUba4KwoPpWiIdIIKx0biUdgr348jJp12v2aa1+e1ZpXOZxFEEZHIEqsME5aIAb0AQtgMEjeAav4M14Ml6Md+Nj3low8plD8AfG5w9IBZfb i X

Encoder Decoder

= x Dec(Enc(x)) 2 + Enc (x) L || ||2 | i |

AAACPXicbVDLSgMxFM34rPVVdekmWIQWscxUQTdC8QEuXFToC9o6ZNJMG5rJDElGLDP9MTf+gzt3blwo4tat6QOsrQcCJ+eey733OAGjUpnmizE3v7C4tJxYSa6urW9spra2K9IPBSZl7DNf1BwkCaOclBVVjNQCQZDnMFJ1uheDevWeCEl9XlK9gDQ91ObUpRgpLdmpUsNDqoMRi2768AzG8fDvuNFDHx7CS4IzVxxnfsVsNo7v8nYeHsCGDD2bwlgbbDppie1U2syZQ8BZYo1JGoxRtFPPjZaPQ49whRmSsm6ZgWpGSCiKGeknG6EkAcJd1CZ1TTnyiGxGw+v7cF8rLej6Qj+u4FCd7IiQJ2XPc7RzsKOcrg3E/2r1ULmnzYjyIFSE49EgN2RQ+XAQJWxRQbBiPU0QFlTvCnEHCYSVDjypQ7CmT54llXzOOsrlb4/ThfNxHAmwC/ZABljgBBTANSiCMsDgEbyCd/BhPBlvxqfxNbLOGeOeHfAHxvcPTTWuFQ== i X

Sebastian Raschka STAT 479: Deep Learning SS 2019 24