<<

Unsupervised Representation Learning with Autoencoders

by

Alireza Makhzani

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Electrical and Computer Engineering

c Copyright 2018 by Alireza Makhzani Abstract

Unsupervised Representation Learning with Autoencoders

Alireza Makhzani Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto 2018

Despite the recent progress in and , unsupervised learning still remains a largely unsolved problem. It is widely recognized that unsupervised learning algorithms that can learn useful representations are needed for solving problems with limited label information. In this thesis, we study the problem of learning unsupervised representations using autoencoders, and propose regularization techniques that enable autoencoders to learn useful representations of data in unsupervised and semi-supervised settings. First, we exploit sparsity as a generic prior on the representations of autoencoders and propose sparse autoencoders that can learn sparse representations with very fast infer- ence processes, making them well-suited to large problem sizes where conventional sparse coding algorithms cannot be applied. Next, we study autoencoders from a probabilistic perspective and propose generative autoencoders that use a generative adversarial network (GAN) to match the distribution of the latent code of the autoencoder with a pre-defined prior. We show that these generative autoencoders can learn posterior approximations that are more expressive than tractable densities often used in variational inference. We demonstrate the performance of developed methods of this thesis on real world image datasets and show their applications in generative modeling, clustering, semi-supervised classification and dimensionality reduction.

ii To my beloved parents and sister:

Nasrin, Hassan, and Parastesh.

iii Acknowledgements

Pursuing a PhD was one of the best decisions of my life. I wish to express my sincerest thanks to my advisor, Brendan Frey. I thank Brendan for his continuous support and encouragement, for believing in the work I wanted to pursue, and for being a great inspiration in my life. Brendan is a teacher and mentor with an unmatched combination of intellect, intuition and wit. I was truly blessed to have him as my advisor. I am very grateful to all my committee members, Rich Zemel, Raquel Urtasun, David Duvenaud and Graham Taylor for offering me their thoughtful comments and feedback. I would like to thank all the past and present members of the PSI lab and the Machine Learning group at U of T, especially Babak Alipanahi, Andrew Delong, Christopher Srinivasa, Jimmy Ba, Hannes Bretschneider, Alice Gao, Hui Xiong, Leo Lee, Michael Leung, and Oren Kraus for sharing ideas and collaborating with me. During my PhD, I did two intenrships at Google, both of which were a wonderful learning experience for me. I would like to thank all the members of Google Brain team and Google DeepMind team, especially Oriol Vinyals, Jon Shlens, Navdeep Jaitly, Ian Goodfellow, Ilya Sutskever, Timothy Lillicrap, Ali Eslami, Sam Bowman and Jon Gauthier. I would like to especially thank Alireza Moghaddamjoo and Hamid Sheikhzadeh Nadjar for inspiring me to pursue academic research and working with me during my undergraduate years at Amirkabir University of Technology in Iran. I had the pleasure of going through my PhD journey with many great friends. In particular, I would like to thank Sadegh Jalali, Aynaz Vatankhah, Masoud Barekatain, Amin Heidari, Weria Havary-Nassab, David Jorjani, Parisa Zareapour, Ehsan Shojaei, Siavash Fazeli and Mohammad Norouzi. I take this opportunity to especially thank Nasrin Tehrani and Hamid Emami. I felt home in Canada thanks to their continuous support during the past years. My deepest gratitude and love, of course, belong to my parents, Nasrin and Hassan, and my sister, Parastesh, for their unconditional love and support throughout my life. To them I owe all that I am and all that I have ever accomplished.

iv Contents

1 Introduction1 1.1 Overview ...... 1 1.2 Unsupervised Representation Learning ...... 2 1.3 Contributions and Outline ...... 10

2 k-Sparse Autoencoders 13 2.1 Introduction ...... 13 2.2 k-Sparse Autoencoders ...... 13 2.3 Analysis of k-Sparse Autoencoders ...... 15 2.4 Experiments ...... 19 2.5 Conclusion ...... 26

3 Winner-Take-All Autoencoders 27 3.1 Introduction ...... 27 3.2 Fully-Connected Winner-Take-All Autoencoders ...... 27 3.3 Convolutional Winner-Take-All Autoencoders ...... 30 3.4 Experiments ...... 35 3.5 Discussion ...... 38 3.6 Conclusion ...... 40 3.7 Appendix ...... 41

4 Adversarial Autoencoders 44 4.1 Introduction ...... 44 4.2 Adversarial Autoencoders ...... 45 4.3 Likelihood Analysis of Adversarial Autoencoders ...... 52 4.4 Supervised Adversarial Autoencoders ...... 53 4.5 Semi-Supervised Adversarial Autoencoders ...... 55 4.6 Clustering with Adversarial Autoencoders ...... 57

v 4.7 Dimensionality Reduction with Adversarial Autoencoders ...... 58 4.8 Conclusion ...... 61 4.9 Appendix ...... 63

5 PixelGAN Autoencoders 66 5.1 Introduction ...... 66 5.2 PixelGAN Autoencoders ...... 67 5.3 Experiments ...... 75 5.4 Learning Cross-Domain Relations with PixelGAN Autoencoders . . . . . 78 5.5 Conclusion ...... 80 5.6 Appendix ...... 81

6 Conclusions 87 6.1 Summary of Contributions ...... 87 6.2 Future Directions ...... 88

Bibliography 90

vi Chapter 1

Introduction

1.1 Overview

The goal of the field of Artificial Intelligence is to enable computers to understand the rich world around us, and to interact with it in an intelligent manner. More recently, “deep learning” has emerged as one of the most promising approaches to achieve this goal. Deep learning is a way of computational learning of high-level concepts in data using deep, hierarchical neural networks. Emerging deep learning methods have revolutionized many real world applications such as [1], speech processing [2], and computational biology [3, 4] by achieving breakthrough performance on different tasks. In for example, deep learning is the only technology that was able to eclipse the well-established 10-year old state of the art benchmarks. Recently, supervised neural networks have been developed and used successfully to produce representations that have enabled leaps forward in classification accuracy for several tasks [1]. These networks are often trained using convolutional architectures with the new regularization techniques that have been developed in the deep learning community such as dropout [5] or maxout [6]. Despite the recent progress in supervised representation learning, the question that has remained unanswered is whether it is possible to learn as “powerful” representations from unlabeled data without any supervision. It is still widely recognized that unsupervised learning algorithms that can extract useful features are needed for solving problems with limited or weak label information, such as video recognition or pedestrian detection [7]. An advantage of unsupervised learning algorithms is the ability to use them in semi-supervised scenarios where the amount of labeled data is limited.

1 Chapter 1. Introduction 2

1.2 Unsupervised Representation Learning

In parallel with research on discriminative architectures, neural networks trained in an unsupervised fashion have been developed and used to produce useful representations for several tasks such as clustering and classification [8, 9, 10, 11, 12]. These approaches include stacked autoencoders, deep Boltzmann machines [8], deep belief networks [9], variational autoencoders [10] and generative adversarial networks [11]. In this section, we briefly review these approaches by broadly categorizing them into three types of learning algorithms: autoencoders, sparse models and generative models. For a more comprehensive introduction to unsupervised representation learning, we recommend the Deep Learning book [13].

1.2.1 Autoencoders

An autoencoder is a neural network that learns to reconstruct its original input with the aim of learning useful representations. The input vector x is first mapped to a hidden representation z = f(x). The function f is parametrized by the encoder network using one or more layers of non-linearity. The hidden representation z is then mapped to the output xˆ = g(z) using the decoder network which parametrizes the decoder function g. Similar to the encoder network, the decoder network can consist of multiple layers of non-linearity. The parameters are optimized to minimize the following cost function:

2 L(x, g(f(x))) = kxˆ − xk2 (1.1)

The cost function of Equation 1.1 is usually optimized with an additional constraint to limit the representational capacity of the autoencoder in order to prevent the autoencoder from learning the useless identity function. The constraint could be imposed on the architecture of the autoencoder (e.g., limiting the latent code dimensionality) or could be in the form of a regularization term in the final objective (e.g., sparsity constraint).

Undercomplete Autoencoders

One way of leaning useful representations with autoencoders is to limit the capacity of the autoencoder by constraining its code size. In this case, the autoencoder is forced to extract the salient feature of the data. Indeed, it can be shown that if an autoencoder uses the linear activation along with the mean squared error criterion, the resulting architecture is equivalent to the PCA algorithm (i.e., the hidden units of the autoencoder Chapter 1. Introduction 3

learn the principal components of the data). However, an autoencoder that uses non-linear activations has a much larger capacity and can learn more useful representations [14].

Sparse Autoencoders

Another way of limiting the representational capacity of the autoencoder to learn useful features is by sparsity regularization. As opposed to undercomplete autoencoders where the code size is smaller than the input dimensionality, sparse autoencoders are typically overcomplete, but impose an additional sparsity constraint on the learnt representation of the data. There are many approaches to impose sparsity on the latent code. One strategy is to add an additional term in the loss function during training to penalize the KL divergence between the hidden unit marginals and a desired sparsity rate [15]. More precisely, lets

define the average activation of hidden unit j for the input xi over the data-distribution as

n 1 X ρˆ = z (x ) (1.2) j n j i i=1 where zj is the activation of hidden unit j, and n is the number of data points in the training data. The cost function of the sparse autoencoder is

m X L(x, g(f(x))) + λ KL(ˆρjkρ) (1.3) j=1 where L(x, g(f(x))) is defined in Equation 1.1, m is the number of hidden units and ρ is the target sparsity rate. This sparsity is often referred to as lifetime sparsity, because it ensures achieving sparse activations across the training data for any given hidden unit. A different approach to learn sparse representations in autoencoders is to add a

regularization term that penalizes the `1 norm of the latent code for any given input x. In this scheme, the final objective is

L(x, g(f(x))) + λkzk1 (1.4)

This sparsity is often referred to as population sparsity, because it ensures achieving sparse activations across the hidden units for any given input x. We will further discuss sparse autoencoders in the context of sparse coding in Sec- tion 1.2.2. Chapter 1. Introduction 4

Denoising Autoencoders

Another way to regularize the autoencoders is by introducing stochasticity during training. As opposed to the standard autoencoder where the network is trained to reconstruct the original input, the denoising autoencoder [16, 17] is trained to reconstruct the input from its corrupted version. The objective of the denoising autoencoder is

L(x, g(f(x˜))) (1.5) where x˜ is copy of x that is corrupted with some form of noise. In the process of learning to undo the input corruption, the denoising autoencoder is encouraged to learn higher level representations that are stable to this corruption process. These robust features have been shown to be useful for downstream tasks such as classification. It has also been shown that denoising autoencoders can be used as generative models to capture the data-distribution [18, 19]. Denoising autoencoders are an example of using stochasticity for regularizing neural networks. Another type of autoencoders that use stochasticity as the regularization technique is the dropout autoencoder [5]. In these autoencoders, in addition to adding noise to the input of the autoencoder, the dropout noise [5] is added to each layer of the autoencoder to achieve further regularization. Dropout autoencoders often outperform denoising autoencoders in the classification tasks.

Contractive Autoencoders

Another strategy to learn useful representations in autoencoders is to explicitly encourage learning features that are robust to slight variations of input values. While denoising autoencoders implicitly encourage the robustness of features to input variations, contractive autoencoders [20] achieve this robustness explicitly by penalizing the Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input using an additional term in the cost function. The final objective function of the contractive autoencoder is

m X 2 L(x, g(f(x))) + λ k∇xzik (1.6) i=1 where zi is the activation of hidden unit i, and m is the number of hidden units. Chapter 1. Introduction 5

Applications of Autoencoders

Autoencoders have many applications including generative modeling, semi-supervised learning, data compression, dimensionality reduction, clustering and information retrieval. In this section, we briefly review some of these applications. Generative Modeling. One of the applications of autoencoders is generative modeling. Recently, it has been shown that denoising autoencoders (Section 1.2.1) and variational autoencoders (Section 1.2.3) are efficient methods for capturing the data- distribution. Therefore, these autoencoders can be used in applications such as image denoising, inpainting, super-resolution, exploration in reinforcement learning, and neural network pretraining. Semi-supervised Learning. An advantage of unsupervised representation learning algorithms such as autoencoders is the ability to use them in semi-supervised scenarios where the amount of labeled data is limited. Once the autoencoder is trained, we can use its learnt unsupervised features for classification tasks by training a linear classifier (e.g., SVM) on top of them using the labeled data. Indeed, the semi-supervised classification task is one of the main approaches to evaluate the quality of unsupervised features learnt by regularized autoencoders such as sparse autoencoders or denoising autoencoders. Another approach for using autoencoders for semi-supervised classification is to train generative autoencoders with discrete latent variables such as semi-supervised variational autoencoders [21]. These generative autoencoders can disentangle the label information from other underlying factors of variation in an unsupervised fashion, which enables the posterior over the discrete latent variables to be used for semi-supervised learning. Data Compression. One of the applications of autoencoders is data-compression. Once an undercomplete autoencoder (Section 1.2.1) is trained, its hidden code can be used to find a low-dimensional representations of the data points. One of the challenges of using autoencoders for data compression is the inherent non-differentiability of the compression loss. However, recently there have been promising progresses in this direction. For example, [22] proposes an autoencoder that uses a smooth approximation to the non-differentiable quantization loss, or [23, 24] propose to use recurrent neural networks for compression. These methods achieve competitive performance to the current state-of- the-art algorithms in lossy image compression such as JPEG2000. Dimensionality Reduction and Clustering. Another application of autoencoders is to map the data onto a low-dimensional space in which the data is more interpretable. For example, [14] successfully trained a deep autoencoder with 30 dimensional code space on documents by pre-training the weights of the autoencoder with the weights of a stacked RBM. It was shown that the autoencoder achieves better reconstruction error than the Chapter 1. Introduction 6

PCA, and also clusters the data into more meaningful and interpretable categories. Information Retrieval. Autoencoders can also be used in information retrieval applications to find an entry in a dataset that is closest to a query. For example, the idea of semantic hashing is proposed in [25], where a deep autoencoder with low-dimensional and binary hidden code is trained on a set of documents. The autoencoder is pre-trained by the weights of a stacked RBM, and during the fine-tuning stage, a Gaussian noise on the hidden code is used to force the network to learn a binary representation. The use of the binary code makes it possible to store the entire dataset in a hash table. For any given query, we can perform information retrieval by finding the entries of the dataset that are mapped to the same binary code of the query, or find similar entries just by changing a few bits of the binary code. Similar semantic hashing approaches have been proposed for vision applications in [26, 27].

1.2.2 Sparse Models

One of the promising approaches to learn unsupervised representations is to exploit sparsity as a generic prior on the representations. Learning sparse representations can be motivated from a biological perspective. A major result in neural coding from Olshausen and Field [28] is that receptive fields learnt on natural images using an unsupervised learning algorithm with the sparsity prior have similar properties to primary visual cortex (V1) receptive fields. One of the advantages of learning sparse models is that they are by nature more easily compressed. When working with sparse representations, by storing only their non-zero values and their corresponding positions, we can significantly reduce the storage space, which also makes computations faster. Sparse feature learning algorithms range from sparse coding approaches [28] to training neural networks with sparsity penalties [29, 15]. These methods typically comprise two steps: a learning algorithm that produces a dictionary W that sparsely represents the N data {xi}i=1, and an encoding algorithm that, given the dictionary, defines a mapping from a new input vector x to a feature vector. A practical problem with sparse coding is that both the dictionary learning and the sparse encoding steps are computationally expensive. Dictionaries are usually learnt offline by iteratively recovering sparse codes and updating the dictionary. Sparse codes are computed using the current dictionary W and a pursuit algorithm to solve

2 zˆi = argminkxi − W zk2 s.t. kzk0 < k (1.7) z Chapter 1. Introduction 7

where zi, i = 1, .., N are the learnt representations. Convex relaxation methods such as `1 minimization or greedy methods such as OMP [30] are used to solve the above optimization. While greedy algorithms are faster, they are still slow in practice. The current sparse codes are then used to update the dictionary, using techniques such as the method of optimal directions (MOD) [31] or K-SVD [32]. These methods are computationally expensive; MOD requires inverting the data matrix at each step and K-SVD needs to compute a SVD in order to update every column of the dictionary. To achieve speedups, in [33, 34], a parameterized non-linear encoder function is trained to explicitly predict sparse codes using a soft thresholding operator. However, they assume that the dictionary is already given and do not address the offline phase. Another approach that has been taken recently is to train autoencoders in a way that encourages sparsity. However, these methods usually involve combinations of acti- vation functions, sampling steps and different kinds of penalties, and are sometimes not guaranteed to produce sparse representations for each input. For example, in [15, 29], a lifetime sparsity penalty function proportional to the negative of the KL divergence between the hidden unit marginals and the target sparsity probability is added to the cost function. This results in sparse activation of hidden units across training points. Another approach to achieve sparse representations is by adding a population sparsity cost

function that penalizes the `1 norm of the hidden representation. As opposed to lifetime sparsity, population sparsity guarantees sparse activations of hidden units for each input.

1.2.3 Generative Models

Building scalable generative models to capture rich distributions such as audio, images or video is one of the central challenges of machine learning. One type of generative models is probabilistic latent variable models (LVMs), which model the joint distribution p(x, z), where x is the observation and z is the associated latent variable. The latent variable models are one of the most promising approaches to unsupervised representation learning. The intuition is that by training the latent variable model where the number of parameters is significantly smaller than the amount of data, the latent code of the model is forced to disentangle the underlying factors of variations and learn useful representations of the data. Until recently, deep generative models, such as Restricted Boltzmann Machines (RBM), Deep Belief Networks (DBNs) and Deep Boltzmann Machines (DBMs) were trained primarily by MCMC-based algorithms [35, 36]. In these approaches the MCMC methods compute the gradient of log-likelihood which becomes more imprecise as training progresses. Chapter 1. Introduction 8

This is because samples from the Markov Chains are unable to mix between modes fast enough. In recent years, generative models have been developed that may be trained via direct back-propagation and avoid the difficulties that come with MCMC training. For example, variational autoencoders (VAE) [10, 37] or importance weighted autoencoders [38] use a recognition network to predict the posterior distribution over the latent variables, generative adversarial networks (GAN) [11] use an adversarial training procedure to directly shape the output distribution of the network via back-propagation. It has been recently shown in [39] that VAEs and GANs can be viewed as an extension of the wake- sleep algorithm [40, 41] that was originally proposed to train sigmoid belief networks. In this section, we briefly review the recently proposed generative models.

Generative Adversarial Networks

The Generative Adversarial Networks (GAN) [11] framework establishes a min-max adver- sarial game between two neural networks – a generative model, G, and a discriminative model, D. The discriminator model, D(x), is a neural network that computes the proba- bility that a point x in data space is a sample from the data distribution (positive samples) that we are trying to model, rather than a sample from our generative model (negative samples). Concurrently, the generator uses a function G(z) that maps samples z from the prior p(z) to the data space. G(z) is trained to maximally confuse the discriminator into believing that samples it generates come from the data distribution. The generator is trained by leveraging the gradient of D(x) w.r.t. x, and using that to modify its parameters. The solution to this game can be expressed as following [11]:

min max x∼p [log D(x)] + z∼p(z)[log(1 − D(G(z))]. (1.8) G D E d E

The generator G and the discriminator D can be found using alternating SGD in two stages: (a) Train the discriminator to distinguish the true samples from the fake samples generated by the generator. (b) Train the generator so as to fool the discriminator with its generated samples. GANs can be considered within the wider framework of implicit generative models [42, 43, 44]. Implicit distributions can be sampled through their generative path, but their likelihood function is not tractable. Recently, several papers have proposed another application of GAN-style algorithms for approximate inference [42, 43, 44, 45, 46, 47, 48, 49]. These algorithms use implicit distributions to learn posterior approximations that are more expressive than the distributions with tractable densities that are often used in variational inference. For example, adversarial autoencoders [46] use a universal Chapter 1. Introduction 9

approximator posterior as the implicit posterior distribution and use adversarial training to match the aggregated posterior of the latent code to the prior distribution. Adversarial variational Bayes [43, 47] uses a more general amortized GAN inference framework within a maximum-likelihood learning setting. Another type of GAN inference technique is used in the ALI [48] and BiGAN [49] models, which have been shown to approximate maximum likelihood learning [43]. In these models, both the recognition and generative models are implicit and are jointly learnt by an adversarial training process.

Generative Moment Matching Networks

Generative moment matching networks (GMMN) [50] are deep generative models that use the maximum mean discrepancy (MMD) objective on the mini-batches of the data to shape the distribution of the output layer of a neural network. The MMD objective can be interpreted as minimizing the distance between all moments of the model distribution and the data distribution. It has been shown that GMMNs can be combined with pre-trained dropout autoencoders to achieve better likelihood results.

Variational Autoencoders

Variational autoencoders (VAE) [10, 37] are probabilistic autoencoders that use a stochastic encoder network to model the posterior distribution q(z|x), and pair it with a top-down generative network that models the conditional log-likelihood log p(x|z). Both networks are jointly trained to maximize the following variational lower bound on the data log- likelihood:

log p(x) > Eq(z|x)[log p(x|z)] − KL(q(z|x)kp(z)) (1.9) where KL is the Kullback–Leibler divergence between two distributions. The objective function of Equation 1.9 can be view as adding the regularization term of KL(q(z|x)kp(z)) to the standard reconstruction term of the autoencoder. The prior p(z) on the latent variable is usually set to be the isotropic multivariate Gaussian p(z) = N (0, I), and the posterior distribution q(z|x) is usually set to be a Gaussian distribution whose mean and variance are predicted by the encoder network. However, [51, 52] have recently proposed efficient techniques to learn more expressive posterior distributions in the VAE framework.

Autoregressive Models

A different framework for learning image statistics is autoregressive neural networks such as Neural Autoregressive Distribution Estimation (NADE) [53], Masked Autoencoder for Distribution Estimation (MADE) [54], PixelRNN [55] and PixelCNN [56]. In these Chapter 1. Introduction 10

autoregressive neural estimators, the joint distribution of pixels is cast as the product of the conditional distributions and a sequence model is used to predict each pixel given all the previously generated pixels. One of the earliest autoregressive neural networks is the fully visible belief network (FVBN) [41] that uses log-linear logistic regressors to model the conditional densities. The NADE model [53] is similar to the FVBN, but uses a deep neural network rather than shallow logistic regressors for modeling the conditional distributions of pixels. MADE [54] models the conditional probabilities using an autoencoder with masked weights, where masking is done in a way that respects the autoregressive decomposition of the joint distribution of the pixels. The PixelRNN [55] uses a recurrent neural network as the sequence model to predict each pixel given all the previously generated pixels. The recently proposed PixelCNN [56] models the conditional densities using a masked convolutional autoencoder, in which the convolutional filters are masked in a way that respects the autoregressive decomposition of the joint probability of the pixels. The autoregressive neural estimators capture the image statistics directly at the pixel level without learning a hierarchical latent representation. So unlike latent variable models, these architectures do not have an intractable inference step; however, the lack of latent variables makes it less straightforward to use these architectures in downstream tasks such as classification. Another drawback of autoregressive architectures is that while they are good at capturing the low-level pixel statistics and local structure, their generated samples often lack the global structure. This is mainly because the currently available sequence modeling techniques are not very effective in capturing long-range dependencies in the sequential data. Scaling the autoregressive generative models to large-scale images is an active area of research.

1.3 Contributions and Outline

In this thesis, we consider the problem of unsupervised representation learning with autoencoders, and propose different regularization techniques that enable autoencoders to learn useful representations of data in an unsupervised fashion. First, we exploit sparsity as a generic prior on the representations of the autoencoder and propose k-sparse autoencoders in Chapter 2 and winner-take-all autoencoders in Chapter 3. These sparse autoencoders can learn sparse representations of the data using fast inference networks, making them well-suited to large problem sizes, where conventional sparse coding algorithms cannot be applied. Next, we study autoencoders from a probabilistic perspective and propose new regularization techniques based on Chapter 1. Introduction 11

using a generative adversarial network (GAN) that imposes a prior distribution on the latent code of the autoencoder. In particular, we propose two generative autoencoders: adversarial autoencoders in Chapter 4 and PixelGAN autoencoders in Chapter 5. We show that these generative autoencoders can learn expressive posterior approximations, which results in learning useful unsupervised representations of the data. Here, we outline the contributions of this thesis in more details: In Chapter 2, we propose the k-sparse autoencoder, which is an autoencoder with linear activation function, where in hidden layers only the k highest activities are kept. When applied to the MNIST and NORB datasets, we find that this method achieves better classification results than denoising autoencoders, networks trained with dropout, and RBMs. k-sparse autoencoders are simple to train and the encoding stage is very fast, making them well-suited to large problem sizes, where conventional sparse coding algorithms cannot be applied. In Chapter 3, we propose a winner-take-all method for learning hierarchical sparse representations in an unsupervised fashion. We first introduce fully-connected winner- take-all autoencoders, which use mini-batch statistics to directly enforce a lifetime sparsity in the activations of the hidden units. We then propose the convolutional winner-take-all autoencoder, which combines the benefits of convolutional architectures and autoencoders for learning shift-invariant sparse representations. We describe a way to train convolutional autoencoders layer by layer, where in addition to lifetime sparsity, a spatial sparsity within each feature map is achieved using winner-take-all activation functions. We will show that winner-take-all autoencoders can be used to to learn deep sparse representations from the MNIST, CIFAR-10, ImageNet, Street View House Numbers and Toronto Face datasets, and achieve competitive classification performance. In Chapter 4, we propose the adversarial autoencoder (AAE), which is a probabilistic autoencoder that uses the recently proposed generative adversarial networks (GAN) to perform variational inference by matching the aggregated posterior of the hidden code vector of the autoencoder with an arbitrary prior distribution. Matching the aggregated posterior to the prior ensures that generating from any part of prior space results in meaningful samples. As a result, the decoder of the adversarial autoencoder learns a deep generative model that maps the imposed prior to the data distribution. We show how the adversarial autoencoder can be used in applications such as semi-supervised classifica- tion, disentangling style and content of images, unsupervised clustering, dimensionality reduction and data visualization. We perform experiments on MNIST, Street View House Numbers and Toronto Face datasets and show that adversarial autoencoders achieve competitive results in generative modeling and semi-supervised classification tasks. Chapter 1. Introduction 12

In Chapter 5, we describe the PixelGAN autoencoder, a generative autoencoder in which the generative path is a convolutional autoregressive neural network on pixels (PixelCNN) that is conditioned on a latent code, and the recognition path uses a generative adversarial network (GAN) to impose a prior distribution on the latent code. We show that different priors result in different decompositions of information between the latent code and the autoregressive decoder. For example, by imposing a Gaussian distribution as the prior, we can achieve a global vs. local decomposition, or by imposing a categorical distribution as the prior, we can disentangle the style and content information of images in an unsupervised fashion. We further show how the PixelGAN autoencoder with a categorical prior can be directly used in semi-supervised settings and achieve competitive semi-supervised classification results on the MNIST, SVHN and NORB datasets. Finally, in Chapter 6, we end this work with concluding remarks and future directions.

Relationship to Published Papers

The chapters in this thesis describe work that has been published in the following papers:

Chapter 2: Alireza Makhzani and Brendan Frey. “k-sparse autoencoders”, Interna- tional Conference on Learning Representations (ICLR), 2014 [57]. Chapter 3: Alireza Makhzani and Brendan Frey. “Winner-take-all autoencoders”, Advances in Neural Information Processing Systems (NIPS), 2015 [58]. Chapter 4: Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. “Adversarial autoencoders”, International Conference on Learning Representations (ICLR) Workshop, 2016 [46]. Chapter 5: Alireza Makhzani and Brendan Frey. “PixelGAN autoencoders”, Ad- vances in Neural Information Processing Systems (NIPS), 2017 [59]. Chapter 2 k-Sparse Autoencoders

2.1 Introduction

Recently, it has been observed that when representations are learnt in a way that encourages sparsity, improved performance is obtained on classification tasks. These methods involve combinations of activation functions, sampling steps and different kinds of penalties. To investigate the effectiveness of sparsity by itself, we propose the “k-sparse autoencoder”, which is an autoencoder with linear activation function, where in hidden layers only the k highest activities are kept. We explore how different sparsity levels (k) impact representations and classification performance. We show that by solely relying on sparsity as the regularizer and as the only nonlinearity, we can achieve much better results than the other methods, including RBMs, denoising autoencoders [17] and dropout [5]. We demonstrate that k-sparse autoencoders are suitable for pretraining and achieve results comparable to state-of-the-art on MNIST and NORB datasets. In this chapter, Γ is an estimated support set and Γc is its complement. W † is the

pseudo-inverse of W and suppk (x) is an operator that returns the indices of the k largest

coefficients of its input vector. zΓ is the vector obtained by restricting the elements of z

to the indices of Γ and WΓ is the matrix obtained by restricting the columns of W to the indices of Γ.

2.2 k-Sparse Autoencoders

2.2.1 The Basic Autoencoder

A shallow autoencoder maps an input vector x to a hidden representation using the function z = f(P x + b), parameterized by {P, b}. f is the activation function, e.g., linear,

13 Chapter 2. k-Sparse Autoencoders 14

k-Sparse Autoencoders: Training: 1) Perform the feedforward phase and compute z = W >x + b 2) Find the k largest activations of z and set the rest to zero.

z(Γ)c = 0 where Γ = suppk (z) 3) Compute the output and the error using the sparsified z. xˆ = W z + b0 2 E = kx − xˆk2 4) Backpropagate the error through the k largest activations defined by Γ and iterate. Sparse Encoding: 1) Compute the features z = W >x + b. 2) Find the αk largest activations of z and set the rest to zero. z(Γ)c = 0 where Γ = suppαk (z)

sigmoidal or ReLU. The hidden representation is then mapped linearly to the output using xˆ = W z + b0. The parameters are optimized to minimize the mean square error of 2 > kxˆ − xk2 over all training points. Often, tied weights are used, so that P = W .

2.2.2 The k-Sparse Autoencoder

The k-sparse autoencoder is based on an autoencoder with linear activation functions and tied weights. In the feedforward phase, after computing the hidden code z = W >x + b, rather than reconstructing the input from all of the hidden units, we identify the k largest hidden units and set the others to zero. This can be done by sorting the activities or by using ReLU hidden units with thresholds that are adaptively adjusted until the k largest activities are identified. This results in a vector of activities with the support set > of suppk (W x + b). Note that once the k largest activities are selected, the function computed by the network is linear. So the only non-linearity comes from the selection of the k largest activities. This selection step acts as a regularizer that prevents the use of an overly large number of hidden units when reconstructing the input. Once the weights are trained, the resulting sparse representations may be used for learning to perform downstream classification tasks. However, it has been observed that often, better results are obtained when the sparse encoding stage used for classification does not exactly match the encoding used for dictionary training [60]. For example, while in k-means, it is natural to have a hard-assignment of the points to the nearest cluster in the encoding stage, it has been shown in [61] that soft assignments result in better Chapter 2. k-Sparse Autoencoders 15

classification performance. Similarly, for the k-sparse autoencoder, instead of using the k largest elements of W >x + b as the features, we have observed that slightly better performance is obtained by using the αk largest hidden units where α ≥ 1 is selected using > validation data. So at the test time, we use the support set defined by suppαk(W x + b). The algorithm is summarized in Table 2.2.2. A related method to the k-sparse autoencoder is the Cardinality RBM [62], which achieves exact sparsity in the RBMs using a dynamic programming algorithm.

2.3 Analysis of k-Sparse Autoencoders

In this section, we explain how the k-sparse autoencoder can be viewed in the context of sparse coding with incoherent matrices. This perspective sheds light on why k-sparse autoencoders work and why they achieve invariant features and consequently good classification results. We first explain a sparse recovery algorithm and then show that the k-sparse autoencoder iterates between an approximation of this algorithm and a dictionary update stage.

2.3.1 Iterative Thresholding with Inversion (ITI)

Iterative hard thresholding [63] is a class of low complexity algorithms, which has recently been proposed for the reconstruction of sparse signals. In this work, we use a variant called “iterative thresholding with inversion” (ITI) [64]. Given a fixed x and W , starting from z0 = 0, ITI iteratively finds the sparsest solution of x = W z using the following steps:

1. Support Estimation Step:

n > n Γ = suppk (z + W (x − W z )) (2.1)

2. Inversion Step: n+1 † > −1 > z = W x = (W WΓ) W x Γ Γ Γ Γ (2.2) n+1 z(Γ)c = 0

> Assume H = W W − I and z0 is the true sparse solution. The first step of ITI > estimates the support set as Γ = suppk (W x) = suppk (z0 + Hz0). If W was orthogonal, we would have Hz0 = 0 and the algorithm would succeed in the first iteration. But if

W is overcomplete, Hz0 behaves as a noise vector whose variance decreases after each Chapter 2. k-Sparse Autoencoders 16

iteration. After estimating the support set of z as Γ, we restrict W to the indices included

in Γ and form WΓ. We then use the pseudo-inverse of WΓ to estimate the non-zero values 2 minimizing kx − WΓzΓk2. Lastly, we refine the support estimation and repeat the whole process until convergence.

2.3.2 Sparse Coding with the k-Sparse Autoencoder

Here, we show that we can derive the k-sparse autoencoder tarining algorithm by approx- imating a sparse coding algorithm that uses the ITI algorithm jointly with a dictionary update stage. The conventional approach of sparse coding is to fix the sparse code matrix Z, while updating the dictionary. However, here, after estimating the support set in the first step of the ITI algorithm, we jointly perform the inversion step of ITI and the dictionary update step, while fixing just the support set of the sparse code Z. In other words, we update the atoms of the dictionary and allow the corresponding non-zero values to change 2 at the same time to minimize kX − WΓZΓk2 over both WΓ and ZΓ. When we are performing sparse recovery with the ITI algorithm using a fixed dictionary, we should perform a fixed number of iterations to get the perfect reconstruction of the signal. But, in sparse coding, since we learnt a dictionary that is adapted to the signals, as shown in Section 2.3.3, we can find the support set just with the first iteration of ITI:

> Γz = suppk (W x) (2.3)

In the inversion step of the ITI algorithm, once we estimate the support set, we use the

pseudo-inverse of WΓ to find the non-zero values of the support set. The pseudo-inverse

of the matrix WΓ is a matrix, such as PΓ, that minimizes the following cost function:

† 2 WΓ = arg minkx − WΓzΓk2 P Γ (2.4) 2 = arg minkx − WΓPΓxk2 PΓ

Finding the exact pseudo-inverse of WΓ is computationally expensive, so instead, we

perform a single step of gradient descent. The gradient with respect to PΓ is found as follows:

∂kx − W z k2 ∂kx − W z k2 Γ Γ 2 = Γ Γ 2 x (2.5) ∂PΓ ∂zΓ The first term of the right hand side of the Equation 2.5 is the dictionary update Chapter 2. k-Sparse Autoencoders 17

stage, which is computed as follows:

2 ∂kx − WΓzΓk2 > = (WΓzΓ − x)zΓ (2.6) ∂zΓ Therefore, in order to approximate the pseudo-inverse, we first find the dictionary derivative and then “backpropagate” it to find the update of the pseudo-inverse. We can view these operations in the context of an autoencoder with linear activations where P is the encoder weight matrix and W is the decoder weight matrix. At each iteration, instead of back-propagating through all the hidden units, we just back-propagate > through the units with the k largest activities, defined by suppk (W x), which is the first iteration of ITI. Keeping the k largest hidden activities and ignoring the others is the same

as forming WΓ by restricting W to the estimated support set. Back-propagation on the decoder weights is the same as gradient descent on the dictionary and back-propagation on the encoder weights is the same as approximating the pseudo-inverse of the corresponding

WΓ. We can perform support estimation in the feedforward phase by assuming P = W > (i.e., the autoencoder has tied weights). In this case, support estimation can be done by computing z = W >x + b and picking the k largest activations; the bias just accounts for the mean and subtracts its contribution. Then the “inversion” and “dictionary update” steps are done at the same time by back-propagation through just the units with the k largest activities. In summary, we can view k-sparse autoencoders as the approximation of a sparse coding algorithm which uses ITI in the sparse recovery stage.

2.3.3 Importance of Incoherence

The coherence of a dictionary indicates the degree of similarity between different atoms or different collections of atoms. Since the dictionary is overcomplete, we can represent each column of the dictionary as a linear combination of other columns. But what incoherence implies is that we should not be able to represent a column as a sparse linear combination of other columns and the coefficients of the linear combination should be dense. For example, if two columns are exactly the same, then the dictionary is highly coherent since we can represent one of those columns as the sparse linear combination of the rest of the columns. A naive measure of coherence that has been proposed in the literature is the mutual coherence µ(W ) which is defined as the maximum absolute inner product across Chapter 2. k-Sparse Autoencoders 18

all the possible pairs of the atoms of the dictionary.

µ(W ) = max |hwi, wji| (2.7) i6=j

There is a close relationship between the coherency of the dictionary and the uniqueness of the sparse solution of x = W z. In [65], it has been proven that if k ≤ (1 + µ−1), then the sparsest solution is unique. We can show that if the dictionary is incoherent enough, there is going to be an attraction ball around the signal x and there is only one unique sparse linear combination of the columns that can get into this attraction ball. So even if we perturb the input with a small amount of noise, translation, rotation, etc., we can still achieve perfect reconstruction of the original signal and the sparse features are always roughly conserved. Therefore, incoherency of the dictionary is a measure of invariance and stability of the features. This is related to the denoising autoencoder [17] in which we achieve invariant features by trying to reconstruct the original signal from its noisy versions. Here we show that if the dictionary is incoherent enough, the first step of the ITI algorithm is sufficient for perfect sparse recovery.

Theorem 3.1. Assume x = W z and the columns of the dictionary have unit `2-norm. Also without loss of generality, assume that the non-zero elements of z are its first k

zk elements and are sorted as z1 ≥ z2 ≥ ... ≥ zk . Then, if kµ ≤ , we can recover the 2z1 > support set of z using suppk (W x). Proof: Let us assume 0 ≤ i ≤ k and y = W >x. Then, we can write:

k k X X yi = zi + hwi, wjizj ≥ zi − µ zj ≥ zk − kµz1 (2.8) j=1,j6=i j=1,j6=i On the other hand:

( k ) X max {yi} = max hwi, wjizj ≤ kµz1 (2.9) i>k i>k j=1

So if kµ ≤ zk , all the first k elements of y are guaranteed to be greater than the rest 2z1 of its elements. As we can see from Theorem 3.1, the chances of finding the true support set with the encoder part of the k-sparse autoencoder depends on the incoherency of the learnt dictionary. As the k-sparse autoencoder converges (i.e., the reconstruction error goes to zero), the algorithm learns a dictionary that satisfies x ≈ W z, so the support set of z can > be estimated using the first step of ITI. Since suppk (W x) succeeds in finding the support Chapter 2. k-Sparse Autoencoders 19

set when the algorithm converges, the learnt dictionary must be sufficiently incoherent.

2.4 Experiments

In this section, we evaluate the performance of k-sparse autoencoders in both unsupervised learning and in shallow and deep discriminative learning tasks.

2.4.1 Datasets

We use the MNIST handwritten digit dataset, which consists of 60,000 training images and 10,000 test images. We randomly separate the training set into 50,000 training cases and 10,000 cases for validation. We also use the small NORB normalized-uniform dataset [66], which contains 24,300 training examples and 24,300 test examples. This database contains images of 50 toys from 5 generic categories: four-legged animals, human figures, airplanes, trucks, and cars. Each image consists of two channels, each of size 96 × 96 pixels. We take the inner 64 × 64 pixels of each channel and resize it using bicubic interpolation to the size of 32 × 32 pixels from which we form a vector with 2048 dimensions as the input. Data points are subtracted by the mean and divided by the standard deviation along each input dimension across the whole training set to normalize the contrast. The training set is separated into 20,000 for training and 4,300 for validation. We also test our method on natural image patches extracted from CIFAR-10 dataset. We randomly extract 1000000 patches of size 8 × 8 from the 50000 32 × 32 images of CIFAR-10. Each patch is then locally contrast-normalized and ZCA whitened. This preprocessing pipeline is the same as the one used in [67] for feature extraction.

2.4.2 Training of k-Sparse Autoencoders

Scheduling of the Sparsity Level

When we are enforcing low sparsity levels in k-sparse autoencoders (e.g., k=15 on MNIST), one issue that might arise is that in the first few epochs, the algorithm greedily assigns individual hidden units to groups of training cases, in a manner similar to k-means clustering. In subsequent epochs, these hidden units will be picked and re-enforced and other hidden units will not be adjusted. That is, too much sparsity can prevent gradient back-propagation from adjusting the weights of these other ‘dead’ hidden units. We can address this problem by scheduling the sparsity level over epochs as follows. Chapter 2. k-Sparse Autoencoders 20

Suppose we are aiming for a sparsity level of k = 15. Then, we start off with a large sparsity level (e.g. k = 100) for which the k-sparse autoencoder can train all the hidden units. We then linearly decrease the sparsity level from k = 100 to k = 15 over the first half of the epochs. This initializes the autoencoder in a good regime, for which all of the hidden units have a significant chance of being picked. Then, we keep k = 15 for the second half of the epochs. With this scheduling, we can train all of the filters, even for low sparsity levels.

Training Hyper-parameters

We optimized the model parameters using stochastic gradient descent with momentum as follows. vk+1 = mk vk − ηk ∇f(xk ) (2.10) xk+1 = xk + vk

Here, vk is the velocity vector, mk is the momentum and ηk is the learning rate at the k-th iteration. We also use a Gaussian distribution with a standard deviation of σ for initialization of the weights. We use different momentum values, learning rates and initializations based on the task and the dataset, and validation is used to select

hyperparameters. In the unsupervised MNIST task, the values were σ = 0.01 , mk = 0.9

and ηk = 0.01, for 5000 epochs. In the supervised MNIST task, training started with

mk = 0.25 and ηk = 1, and then the learning rate was linearly decreased to 0.001 over

200 epochs. In the unsupervised NORB task, the values were σ = 0.01, mk = 0.9 and

ηk = 0.0001, for 5000 epochs. In the supervised NORB task, training started with

mk = 0.9 and ηk = 0.01, and then the learning rate was linearly decreased to 0.001 over 200 epochs.

Implementations

While most of the conventional sparse coding algorithms require complex matrix operations such as matrix inversion or SVD decomposition, k-sparse autoencoders only need matrix multiplications and sorting operations in both dictionary learning stage and the sparse encoding stage. (For a parallel, distributed implementation, the sorting operation can be replaced by a method that recursively applies a threshold until k values remain.) We used an efficient GPU implementation obtained using the publicly available gnumpy library [68] on a single Nvidia GTX 680 GPU. Chapter 2. k-Sparse Autoencoders 21

2.4.3 Effect of Sparsity Level

In k-sparse autoencoders, we are able to tune the value of k to obtain the desirable sparsity level which makes the algorithm suitable for a wide variety of datasets. For example, one application could be pre-training a shallow or deep discriminative neural network. For large values of k (e.g., k = 100 on MNIST), the algorithm tends to learn very local features as is shown in Figure 2.1a and Figure 2.2a. These features are too primitive to be used for classification using a shallow architecture since a naive linear classifier does not have enough capacity to combine these features and achieve a good classification rate. However, these features could be used for pre-training deep neural nets. As we decrease the the sparsity level (e.g., k = 40 on MNIST), the output is recon- structed using a smaller number of hidden units and thus the features tend to be more global, as can be seen in Figure 2.1b, Figure 2.1c and Figure 2.2b. For example, in the MNIST dataset, the lengths of the strokes increase when the sparsity level is decreased. These less local features are suitable for classification using a shallow architecture. Never- theless, forcing too much sparsity (e.g., k = 10 on MNIST), results in features that are too global and do not factor the input into parts, as depicted Figure 2.1d and Figure 2.2c. Figure 2.3 shows the visualization of filters of the k-sparse autoencoder with 1000 hidden units and sparsity level of k = 50 learnt from random image patches extracted from CIFAR-10 dataset. We can see that the k-sparse autoencoder has learnt localized Gabor filters from natural image patches. Figure 2.4 plots histograms of the hidden unit activities for various unsupervised learning algorithms, including the k-sparse autoencoder (k=70 and k=15), applied to the MNIST data. This figure contrasts the sparsity achieved by the k-sparse autoencoder with that of other algorithms.

2.4.4 Unsupervised Feature Learning Results

In order to compare the quality of the features learnt by our algorithm with those learnt by other unsupervised learning methods, we first extracted features using each unsupervised learning algorithm. Then we fixed the features and trained a logistic regression classifier using those features. The usefulness of the features is then evaluated by examining the error rate of the classifier. We trained a number of architectures on the MNIST and NORB datasets, including RBM, dropout autoencoder and denoising autoencoder. In dropout, after finding the features using dropout regularization with a dropout rate of 50%, we used all of the Chapter 2. k-Sparse Autoencoders 22

(a) k = 70

(b) k = 40

(c) k = 25

(d) k = 10 Figure 2.1: Filters of the k-sparse autoencoder for different sparsity levels k, learnt from MNIST with 1000 hidden units.

(a) k = 200

(b) k = 150

(c) k = 50 Figure 2.2: Filters of the k-sparse autoencoder for different sparsity levels k, learnt from NORB with 4000 hidden units. Chapter 2. k-Sparse Autoencoders 23

Figure 2.3: Filters of k-sparse autoencoder with 1000 hidden units and k = 50, learnt from CIFAR-10 random patches.

hidden units as the features (this worked best). For the denoising autoencoder, after training the network by dropping the input pixels with a rate of 20%, we used all of the uncorrupted input pixels to find the features for classification (this worked best). In the > k-sparse autoencoder, after training the dictionary, we used z = suppαk(W x + b) to find the features as explained in Section 2.2.2, where α was determined using validation data. Results for different architectures are compared in Table 2.1 and Table 2.2. We can see that the performance of our k-sparse autoencoder is better than the rest of the algorithms. In our algorithm, the best result is achieved by k = 25, α = 3 with 1000 hidden units on MNIST dataset and by k = 150, α = 2 with 4000 hidden units on NORB dataset.

Log histogram of hidden activities 8 ReLU Autoencoder Dropout Autoencoder, 50% hidden and 20% input 7 k-Sparse Autoencoder, k=70 k-Sparse Autoencoder, k=15 6

5

4

3

2

1

0 0 1 2 3 4 5 6

Figure 2.4: Histogram of hidden unit activities for various unsupervised learning methods. Chapter 2. k-Sparse Autoencoders 24

Error Rate Raw Pixels 7.20% RBM 1.81% Dropout Autoencoder (50% hidden) 1.80% Denoising Autoencoder (20% input dropout) 1.95% Dropout + Denoising Autoencoder (20% input and 50% hidden) 1.60% k-Sparse Autoencoder, k = 40 1.54% k-Sparse Autoencoder, k = 25 1.35% k-Sparse Autoencoder, k = 10 2.10% Table 2.1: Performance of unsupervised learning methods (without fine-tuning) with 1000 hidden units on MNIST.

Error Rate Raw Pixels 23% RBM (weight decay) 10.6% Dropout Autoencoder 10.1% Denoising Autoencoder (20% input dropout) 9.5% k-Sparse Autoencoder, k = 200 10.4% k-Sparse Autoencoder, k = 150 8.6% k-Sparse Autoencoder, k = 75 9.5% Table 2.2: Performance of unsupervised learning methods (without fine-tuning) with 4000 hidden units on NORB.

2.4.5 Shallow Supervised Learning Results

In supervised learning, it is a common practice to use the encoder weights learnt by an unsupervised learning method to initialize the early layers of a multilayer discriminative model [69]. The back-propagation algorithm is then used to adjust the weights of the last hidden layer and also to fine-tune the weights in the previous layers. This procedure is often referred to as discriminative fine-tuning. In this section, we report results using unsupervised learning algorithms such as RBMs, DBNs [9], DBMs [9], third-order RBM [29], dropout autoencoders, denoising autoencoders and k-sparse autoencoders to initialize a shallow discriminative neural network for the MNIST and NORB datasets. We used back-propagation to fine-tune the weights. The regularization method used in the fine-tuning stage of different algorithms is the same as the one used in the training of the corresponding unsupervised learning task. For instance, we fine-tuned the weights obtained from dropout autoencoder with dropout regularization or in denoising autoencoder, we fine-tuned the discriminative neural net by adding noise to the input. In a similar manner, in the fine-tuning stage of the k-sparse autoencoder, we used the αk largest hidden units in Chapter 2. k-Sparse Autoencoders 25

Error Without Pre-Training 1.60% RBM + Fine Tuning 1.24% Shallow Dropout AE + Fine Tuning (%50 hidden) 1.05% Denoising AE + Fine Tuning (%20 input dropout) 1.20% Deep Dropout AE + Fine Tuning (Layer-wise pre-training, %50 hidden) 0.85% k-Sparse AE + Fine Tuning (k=25) 1.08% Deep k-Sparse AE + Fine Tuning (Layer-wise pre-training) 0.97% Table 2.3: Performance of supervised learning methods on MNIST. Pre-training was performed using the corresponding unsupervised learning algorithm with 1000 hidden units, and then the model was fine-tuned.

Error Without Pre-Training 12.7% DBN 8.3% DBM 7.2% Third-Order RBM 6.5% Shallow Dropout AE + Fine Tuning (%50 hidden) 8.2% Shallow Denoising AE + Fine Tuning (%20 input dropout) 7.9% Deep Dropout AE + Fine Tuning (Layer-wise pre-training, %50 hidden) 7.0% Shallow k-Sparse AE + Fine Tuning (k=150) 7.8% Deep k-Sparse AE + Fine Tuning (k=150, Layer-wise pre-training) 7.4% Table 2.4: Performance of supervised learning methods on NORB. Pre-training was performed using the corresponding unsupervised learning algorithm with 4000 hidden units, and then the model was fine-tuned.

the corresponding discriminative neural network, as explained in Section 2.2.2. Table 2.3 and Table 2.4 report the error rates obtained by different methods.

2.4.6 Deep Supervised Learning Results

The k-sparse autoencoder can be used as a building block of a deep neural network, using greedy layer-wise pre-training [70]. We first train a shallow k-sparse autoencoder and obtain the hidden codes. We then fix the features and train another k-sparse autoencoder on top of them to obtain another set of hidden codes. Then we use the parameters of these autoencoders to initialize a discriminative neural network with two hidden layers. In the fine-tuning stage of the deep neural net, we first fix the parameters of the first and second layers and train a softmax classifier on top of the second layer. We then hold Chapter 2. k-Sparse Autoencoders 26

the weights of the first layer fixed and train the second layer and softmax jointly using the initialization of the softmax that we found in the previous step. Finally, we jointly fine-tune all of the layers with the previous initialization. We have observed that this method of layer-wise fine-tuning can improve the classification performance compared to the case where we fine-tune all the layers at the same time. In all of the fine-tuning steps, we keep the αk largest hidden codes, where k = 25, α = 3 in MNIST and k = 150, α = 2 in NORB in both hidden layers. Table 2.3 and Table 2.4 report the classification results of different deep supervised learning methods.

2.5 Conclusion

In this chapter, we proposed a very fast sparse coding method called k-sparse autoencoder, which achieves exact sparsity in the hidden representation. The main message of this chapter is that we can use the resulting representations to achieve competitive classification results, solely by enforcing sparsity in the hidden units and without using any other nonlinearity or regularization. We also discussed how the k-sparse autoencoder could be used for pre-training shallow and deep supervised architectures. Chapter 3

Winner-Take-All Autoencoders

3.1 Introduction

In this chapter, we exploit sparsity as a generic prior on the representations for unsuper- vised feature learning, and propose a winner-take-all method for learning hierarchical sparse representations with autoencoders. We first introduce fully-connected winner-take- all autoencoders which use mini-batch statistics to directly enforce a lifetime sparsity in the activations of the hidden units. We then propose the convolutional winner-take-all autoencoder which combines the benefits of convolutional architectures and autoencoders for learning shift-invariant/convolutional sparse representations. We describe a way to train convolutional autoencoders layer by layer, where in addition to lifetime sparsity, a spatial sparsity within each feature map is achieved using winner-take-all activation functions. We will show that winner-take-all autoencoders can be used to learn deep sparse representations from the MNIST, CIFAR-10, ImageNet, Street View House Numbers and Toronto Face datasets, and achieve competitive classification performance.

3.2 Fully-Connected Winner-Take-All Autoencoders

Training sparse autoencoders has been well studied in the literature. For example, in [15], a “lifetime sparsity” penalty function proportional to the KL divergence between the hidden unit marginals (ρˆ) and the target sparsity probability (ρ) is added to the cost function: λKL(ρkρˆ). A major drawback of this approach is that it only works for certain target sparsities and is often very difficult to find the right λ parameter that results in a properly trained sparse autoencoder. Also KL divergence was originally proposed for sigmoidal autoencoders, and it is not clear how it can be applied to ReLU autoencoders

27 Chapter 3. Winner-Take-All Autoencoders 28 where ρˆ could be larger than one (in which case the KL divergence can not be evaluated). In this chapter, we propose Fully-Connected Winner-Take-All (FC-WTA) autoencoders to address these concerns. FC-WTA autoencoders can aim for any target sparsity rate, train very fast (marginally slower than a standard autoencoder), have no hyper-parameter to be tuned (except the target sparsity rate) and efficiently train all the dictionary atoms even when very aggressive sparsity rates (e.g., 1%) are enforced. Sparse coding algorithms typically comprise two steps: a highly non-linear sparse encoding operation that finds the “right” atoms in the dictionary, and a linear decoding stage that reconstructs the input with the selected atoms and update the dictionary. The FC-WTA autoencoder is a non-symmetric autoencoder where the encoding stage is typically a stack of several ReLU layers and the decoder is just a linear layer. In the feedforward phase, after computing the hidden codes of the last layer of the encoder, rather than reconstructing the input from all of the hidden units, for each hidden unit, we impose a lifetime sparsity by keeping the k percent largest activation of that hidden unit across the mini-batch samples and setting the rest of activations of that hidden unit to zero. In the backpropagation phase, we only backpropagate the error through the k percent non-zero activations. In other words, we are using the min-batch statistics to approximate the statistics of the activation of a particular hidden unit across all the samples, and finding a hard threshold value for which we can achieve k% lifetime sparsity rate. In this setting, the highly nonlinear encoder of the network (ReLUs followed by top-k sparsity) learns to do sparse encoding, and the decoder of the network reconstructs the input linearly. At test time, we turn off the sparsity constraint and the output of the deep ReLU network will be the final representation of the input. In order to train a stacked FC-WTA autoencoder, we fix the weights and train another FC-WTA autoencoder on top of the fixed representation of the previous network. The learnt dictionary of a FC-WTA autoencoder trained on MNIST, CIFAR-10 and Toronto Face datasets are visualized in Figure 3.1 and Figure 3.2. For large sparsity levels, the algorithm tends to learn very local features that are too primitive to be used for classification (Figure 3.1a). As we decrease the sparsity level, the network learns more useful features (longer digit strokes) and achieves better classification (Figure 3.1b and Figure 3.1c). Nevertheless, forcing too much sparsity results in features that are too global and do not factor the input into parts (Figure 3.1d and Figure 3.1e). Section 3.4.1 reports the classification results. Winner-Take-All RBMs. Besides autoencoders, WTA activations can also be used in Restricted Boltzmann Machines (RBM) to learn sparse representations. Suppose h and v denote the hidden and visible units of RBMs. For training WTA-RBMs, in the Chapter 3. Winner-Take-All Autoencoders 29

(a) MNIST, 10% sparsity

(b) MNIST, 5% sparsity

(c) MNIST, 3% sparsity

(d) MNIST, 2% sparsity

(e) MNIST, 1% sparsity

Figure 3.1: Learnt dictionary (decoder) of the FC-WTA autoencoder with 1000 hidden units and differen sparsity rates trained on MNIST. Chapter 3. Winner-Take-All Autoencoders 30

(a) Toronto Face Dataset (48 × 48) (b) CIFAR-10 Patches (11 × 11) Figure 3.2: Dictionaries (decoder) of FC-WTA autoencoder with 256 hidden units and sparsity of 5%.

positive phase of the contrastive divergence, instead of sampling from P (hi|v), we first

keep the k% largest P (hi|v) for each hi across the mini-batch dimension and set the rest

of P (hi|v) values to zero, and then sample hi according to the sparsified P (hi|v). Filters of a WTA-RBM trained on MNIST are visualized in Figure 3.3. We can see WTA-RBMs learn longer digit strokes on MNIST, which as will be shown in Section 3.4.1, improves the classification rate. Note that the sparsity rate of WTA-RBMs (e.g., 30%) should not be as aggressive as WTA autoencoders (e.g., 5%), since RBMs are already being regularized by having binary hidden states. A related method to the winner-take-all RBM is the Cardinality RBM [62], which achieves exact sparsity in the RBMs using a dynamic programming algorithm.

3.3 Convolutional Winner-Take-All Autoencoders

There are several problems with applying conventional sparse coding methods on large images. First, it is not practical to directly apply a fully-connected sparse coding algorithm on high-resolution (e.g., 256 × 256) images. Second, even if we could do that, we would learn a very redundant dictionary whose atoms are just shifted copies of each other. For example, in Figure 3.2a, the FC-WTA autoencoder has allocated different filters for the same patterns (i.e., mouths/noses/glasses/face borders) occurring at different locations. One way to address this problem is to extract random image patches from

(a) Standard RBM (b) WTA-RBM (sparsity of 30%)

Figure 3.3: Features learned on MNIST by 256 hidden unit RBMs. Chapter 3. Winner-Take-All Autoencoders 31

input images and then train an unsupervised learning algorithm on these patches in isolation [60]. Once training is complete, the filters can be used in a convolutional fashion to obtain representations of images. As discussed in [60, 71], the main problem with this approach is that if the receptive field is small, this method will not capture relevant features (imagine the extreme of 1 × 1 patches). Increasing the receptive field size is problematic, because then a very large number of features are needed to account for all the position-specific variations within the receptive field. For example, we see that in Figure 3.2b, the FC-WTA autoencoder allocates different filters to represent the same horizontal edge appearing at different locations within the receptive field. As a result, the learnt features are essentially shifted versions of each other, which results in redundancy between filters. Unsupervised methods that make use of convolutional architectures can be used to address this problem, including convolutional RBMs [72], convolutional DBNs [73, 72], deconvolutional networks [74] and convolutional predictive sparse decomposition (PSD) [71, 7]. These methods learn features from the entire image in a convolutional fashion. In this setting, the filters can focus on learning the shapes (i.e., “what”), because the location information (i.e., “where”) is encoded into feature maps and thus the redundancy among the filters is reduced. In this section, we propose Convolutional Winner-Take-All (CONV-WTA) autoen- coders that learn to do shift-invariant/convolutional sparse coding by directly enforcing winner-take-all spatial and lifetime sparsity constraints. Our work is similar in spirit to deconvolutional networks [74] and convolutional PSD [71, 7], but whereas the approach in that work is to break apart the recognition pathway and data generation pathway, but learn them so that they are consistent, we describe a technique for directly learning a sparse convolutional autoencoder. A shallow convolutional autoencoder maps an input vector to a set of feature maps in a convolutional fashion. We assume that the boundaries of the input image are zero-padded, so that each feature map has the same size as the input. The hidden representation is then mapped linearly to the output using a deconvolution operation (Appendix 3.7.1). The parameters are optimized to minimize the mean square error. A non-regularized convolutional autoencoder learns useless delta function filters that copy the input image to the feature maps and copy back the feature maps to the output. Interestingly, we have observed that even in the presence of denoising[16]/dropout[5] regularizations, convolutional autoencoders still learn useless delta functions. Figure 3.4a depicts the filters of a convolutional autoencoder with 16 maps, 20% input and 50% hidden unit dropout trained on Street View House Numbers dataset [75]. We see that the 16 learnt delta functions make 16 copies of the input pixels, so even if half of the Chapter 3. Winner-Take-All Autoencoders 32

(a) Dropout CONV Autoencoder (b) CONV-WTA Autoencoder

Figure 3.4: (a) Filters and feature maps of a denoising/dropout convolutional autoen- coder, which learns useless delta functions. (b) Proposed architecture for CONV-WTA autoencoder with spatial sparsity (128conv5-128conv5-128deconv11).

hidden units get dropped during training, the network can still rely on the non-dropped copies to reconstruct the input. This highlights the need for new and more aggressive regularization techniques for convolutional autoencoders. The proposed architecture for CONV-WTA autoencoder is depicted in Figure 3.4b. The CONV-WTA autoencoder is a non-symmetric autoencoder where the encoder typically consists of a stack of several ReLU convolutional layers (e.g., 5 × 5 filters) and the decoder is a linear deconvolutional layer of larger size (e.g., 11 × 11 filters). We chose to use a deep encoder with smaller filters (e.g., 5 × 5) instead of a shallow one with larger filters (e.g., 11 × 11), because the former introduces more non-linearity and regularizes the network by forcing it to have a decomposition over large receptive fields through smaller filters. The CONV-WTA autoencoder is trained under two winner-take-all sparsity constraints: spatial sparsity and lifetime sparsity.

3.3.1 Spatial Sparsity

In the feedforward phase, after computing the last feature maps of the encoder, rather than reconstructing the input from all of the hidden units of the feature maps, we identify the single largest hidden activity within each feature map, and set the rest of the activities as well as their derivatives to zero. This results in a sparse representation whose sparsity level is the number of feature maps. The decoder then reconstructs the output using only the active hidden units in the feature maps and the reconstruction error is only backpropagated through these hidden units as well. Consistent with other representation learning approaches such as triangle k-means [60] and deconvolutional networks [74, 76], we observed that using a softer sparsity constraint at test time results in a better classification performance. So, in the CONV-WTA Chapter 3. Winner-Take-All Autoencoders 33

0 0 0 0 0

5 20 20 10 10

0 40 40 20 20 5 60 60 30

30 0 80 80 40

40 50 5 100 100

60 0 5 10 15 20 25 0 10 20 30 40 0 20 40 60 80 100 0 20 40 60 80 100 0 10 20 30 40 50 60

0

5

0

5 0 50 100 150

Figure 3.5: The CONV-WTA autoencoder with 16 first layer filters and 128 second layer filters trained on MNIST: (a) Input image. (b) Learnt dictionary (deconvolution filters). (c) 16 feature maps while training (spatial sparsity applied). (d) 16 feature maps after training (spatial sparsity turned off). (e) 16 feature maps of the first layer after applying local max-pooling. (f) 48 out of 128 feature maps of the second layer after turning off the sparsity and applying local max-pooling (final representation).

autoencoder, in order to find the final representation of the input image, we simply turn off the sparsity regularizer and use ReLU convolutions to compute the last layer feature maps of the encoder. After that, we apply max-pooling (e.g., over 4 × 4 regions) on these feature maps and use this representation for classification tasks or in training stacked CONV-WTA autoencoders as will be discussed in Section 3.3.3. Figure 3.5 shows a CONV-WTA autoencoder that was trained on MNIST.

3.3.2 Lifetime Sparsity

Although spatial sparsity is very effective in regularizing the autoencoder, it requires all the dictionary atoms to contribute in the reconstruction of every image. We can further increase the sparsity by exploiting the winner-take-all lifetime sparsity as follows. Suppose

(a) Spatial sparsity only (b) Spatial & lifetime sparsity 20% (c) Spatial & lifetime sparsity 5%

Figure 3.6: Learnt dictionary (deconvolution filters) of CONV-WTA autoencoder trained on MNIST (64conv5-64conv5-64conv5-64deconv11). Chapter 3. Winner-Take-All Autoencoders 34

(a) Spatial sparsity only (b) Spatial and lifetime sparsity of 10% Figure 3.7: Learnt dictionary (deconvolution filters) of CONV-WTA autoencoder trained on the Toronto Face dataset (64conv7-64conv7-64conv7-64deconv15). we have 128 feature maps and the mini-batch size is 100. After applying spatial sparsity, for each filter we will have 100 “winner” hidden units corresponding to the 100 mini-batch images. During feedforward phase, for each filter, we only keep the k% largest of these 100 values and set the rest of activations to zero. Note that despite this aggressive sparsity, every filter is forced to get updated upon visiting every mini-batch, which is crucial for avoiding the dead filter problem that often occurs in sparse coding. Figure 3.6 and Figure 3.7 show the effect of the lifetime sparsity on the dictionaries trained on MNIST and Toronto Face dataset. We see that similar to the FC-WTA autoencoders, by tuning the lifetime sparsity of CONV-WTA autoencoders, we can aim for different sparsity rates. If no lifetime sparsity is enforced, we learn local filters that contribute to every training point (Figure 3.6a and Figure 3.7a). As we increase the lifetime sparsity, we can learn rare but useful features that result in better classification (Figure 3.6b). Nevertheless, forcing too much lifetime sparsity will result in features that are too diverse and rare and do not properly factor the input into parts (Figure 3.6c and Figure 3.7b).

3.3.3 Stacked CONV-WTA Autoencoders

The CONV-WTA autoencoder can be used as a building block to form a hierarchy. In order to train the hierarchical model, we first train a CONV-WTA autoencoder on the input images. Then we pass all the training examples through the network and obtain their representations (last layer of the encoder after turning off sparsity and applying local max-pooling). Now we treat these representations as a new dataset and train another CONV-WTA autoencoder to obtain the stacked representations. Figure 3.5f shows the deep feature maps of a stacked CONV-WTA autoencoder that was trained on MNIST.

3.3.4 Scaling CONV-WTA Autoencoders to Large Images

The goal of convolutional sparse coding is to learn shift-invariant dictionary atoms and encoding filters. Once the filters are learnt, they can be applied convolutionally to any Chapter 3. Winner-Take-All Autoencoders 35

(a) Spatial sparsity only (b) Spatial and lifetime sparsity of 10%

Figure 3.8: Learnt dictionary (deconvolution filters) of CONV-WTA autoencoder trained on ImageNet 48 × 48 whitened patches. (64conv5-64conv5-64conv5-64deconv11).

image of any size, and produce a spatial map corresponding to different locations at the input. We can use this idea to efficiently train CONV-WTA autoencoders on datasets containing large images. Suppose we want to train an AlexNet [1] architecture in an unsupervised fashion on ImageNet, ILSVRC-2012 (224x224). In order to learn the first layer 11 × 11 shift-invariant filters, we can extract medium-size image patches of size 48 × 48 and train a CONV-WTA autoencoder with 64 dictionary atoms of size 11 on these patches. This will result in 64 shift-invariant filters of size 11 × 11 that can efficiently capture the statistics of 48 × 48 patches. Once the filters are learnt, we can apply them in a convolutional fashion with the stride of 4 to the entire images and after max-pooling we will have a 64 × 27 × 27 representation of the images. Now we can train another CONV-WTA autoencoder on top of these feature maps to capture the statistics of a larger receptive field at different location of the input image. This process could be repeated for multiple layers. Figure 3.8 shows the dictionary learnt on the ImageNet using this approach. We can see that by imposing lifetime sparsity, we could learn very diverse filters such as corner, circular and blob detectors.

3.4 Experiments

In all the experiments of this section, we evaluate the quality of unsupervised features of WTA autoencoders by training a naive linear classifier (i.e., SVM) on top them. We did not fine-tune the filters in any of the experiments. The details of all the experiments are provided in Appendix 3.7.2.

3.4.1 Winner-Take-All Autoencoders on MNIST

The MNIST dataset has 60K training points and 10K test points. Table 3.1 compares the performance of FC-WTA autoencoder and WTA-RBMs with other permutation-invariant architectures. Table 3.2 compares the performance of CONV-WTA autoencoder with other convolutional architectures. In these experiments, we have used all the available Chapter 3. Winner-Take-All Autoencoders 36

Error Rate Shallow Denoising/Dropout Autoencoder (20% input, 50% hidden dropout) 1.60% Stacked Denoising Autoencoder (3 layers) [16] 1.28% Deep Boltzmann Machines [8] 0.95% k-Sparse Autoencoder [57] 1.35% Shallow FC-WTA Autoencoder, 2000 units, 5% sparsity 1.20% Stacked FC-WTA Autoencoder, 5% and 2% sparsity 1.11% Restricted Boltzmann Machines 1.60% Winner-Take-All Restricted Boltzmann Machines (30% sparsity) 1.38% Table 3.1: Classification performance of FC-WTA autoencoder features + SVM on MNIST.

training labels (N = 60000 points) to train a linear SVM on top of the unsupervised features. An advantage of unsupervised learning algorithms is the ability to use them in semi- supervised scenarios where labeled data is limited. Table 3.3 shows the semi-supervised performance of the CONV-WTA autoencoder where we have assumed only N labels are available. In this case, the unsupervised features are still trained on the whole dataset (60K points), but the SVM is trained only on the N labeled points where N varies from 300

Error Deep Deconvolutional Network [74, 76] 0.84% Convolutional Deep Belief Network [72] 0.82% Scattering Convolution Network [77] 0.43% Convolutional Kernel Network [78] 0.39% CONV-WTA Autoencoder, 16 maps 1.02% CONV-WTA Autoencoder, 128 maps 0.64% Stacked CONV-WTA Autoencoder, 128 & 2048 maps 0.48% Table 3.2: Classification performance of CONV-WTA autoencoder trained on MNIST. Unsupervised features + SVM trained on N = 60000 labels (no fine-tuning)

N=300 N=600 N=1K N=2K N=5K N=10K N=60K Supervised Conv. Network [79] 7.18% 5.28% 3.21% 2.53% 1.52% 0.85% 0.53% Convolutional Kernel Network [78] 4.15% - 2.05% 1.51% 1.21% 0.88% 0.39% Scattering Convolution Network [77] 4.70% - 2.30% 1.30% 1.03% 0.88 % 0.43% CONV-WTA Autoencoder 3.47% 2.37% 1.92% 1.45% 1.07% 0.91% 0.48% Table 3.3: Classification performance of CONV-WTA autoencoder trained on MNIST. Unsupervised features + SVM trained on few labels N. (semi-supervised) Chapter 3. Winner-Take-All Autoencoders 37

Accuracy Convolutional Triangle k-means [75] 90.6% CONV-WTA Autoencoder, 256 maps (N=600K) 88.5% Stacked CONV-WTA Autoencoder, 256 and 1024 maps (N=600K) 93.1% Deep Variational Autoencoders (non-convolutional) [80] (N=1000) 63.9% Stacked CONV-WTA Autoencoder, 256 and 1024 maps (N=1000) 76.2% Supervised Maxout Network [6] (N=600K) 97.5% Table 3.4: Unsupervised features of the CONV-WTA autoencoder + SVM trained on N labeled points of SVHN dataset.

to 60K. We compare this with the performance of a supervised deep convnet (CNN) [79] trained only on the N labeled training points. We can see supervised deep learning techniques fail to learn good representations when labeled data is limited, whereas our WTA algorithm can extract useful features from the unlabeled data and achieve a better classification. We also compare our method with some of the best semi-supervised learning results recently obtained by convolutional kernel networks (CKN) [78] and convolutional scattering networks (SC) [77]. We see CONV-WTA outperforms both these methods when very few labels are available (N < 1K).

3.4.2 CONV-WTA Autoencoder on Street View House Numbers

The SVHN dataset has about 600K training points and 26K test points. We first apply global contrast normalization to the images and then used local contrast normalization using a Gaussian kernel to preprocess each channel of the images. This is the same preprocessing that is used in [6]. The contrast normalized SVHN images are shown in Figure 3.9a. Table 3.4 reports the classification results of CONV-WTA autoencoder on this dataset. We first trained a shallow and stacked CONV-WTA autoencoder on all 600K training cases to learn the unsupervised features, and then performed two sets of

(a) Contrast Normalized SVHN (b) Learnt Dictionary (64conv5-64conv5-64conv5-64deconv11)

Figure 3.9: CONV-WTA autoencoder trained on the Street View House Numbers (SVHN) dataset. Chapter 3. Winner-Take-All Autoencoders 38

experiments. In the first experiment, we used all the N=600K available labels to train an SVM on top of the CONV-WTA autoencoder features, and compared the result with convolutional k-means [75]. We see that the stacked CONV-WTA autoencoder achieves a dramatic improvement over the shallow CONV-WTA autoencoder as well as k-means. In the second experiment, we trained an SVM by using only N = 1000 labeled data points and compared the result with deep variational autoencoders [80] trained in a same semi-supervised fashion. Figure 3.9b shows the learnt dictionary of CONV-WTA autoencoder on this dataset.

3.4.3 CONV-WTA Autoencoder on CIFAR-10

Table 3.5 reports the classification results of CONV-WTA autoencoders on the CIFAR-10 dataset. We see when a small number of feature maps (< 256) are used, considerable improvements over k-means can be achieved. This is because our method can learn a shift-invariant dictionary as opposed to the redundant dictionaries learnt by patch-based methods such as k-means. In the largest deep network that we trained, we used 256, 1024, 4096 maps and achieved the classification rate of 80.1% without using fine-tuning, model averaging or data augmentation. Figure 3.10 shows the learnt dictionary on the CIFAR-10 dataset. We can see that the network has learnt diverse shift-invariant filters such as point/corner detectors as opposed to Figure 3.2b that shows the position-specific filters of patch-based methods.

3.5 Discussion

Relationship of FC-WTA autoencoders to k-sparse autoencoders. k-sparse autoencoders impose sparsity across different channels (population sparsity), whereas

Figure 3.10: CONV-WTA autoencoder trained on the CIFAR-10 dataset. Chapter 3. Winner-Take-All Autoencoders 39

Accuracy Shallow Convolutional Triangle k-means (64 maps) [60] 62.3% Shallow CONV-WTA Autoencoder (64 maps) 68.9% Shallow Convolutional Triangle k-means (256 maps) [60] 70.2% Shallow CONV-WTA Autoencoder (256 maps) 72.3% Shallow Convolutional Triangle k-means (4000 maps) [60] 79.6% Deep Triangle k-means (1600, 3200, 3200 maps) [81] 82.0% Convolutional Deep Belief Net (2 layers) [73] 78.9% Exemplar CNN (300x Data Augmentation) [82] 82.0% NOMP (3200,6400,6400 maps + Averaging 7 Models) [83] 82.9% Stacked CONV-WTA (256, 1024 maps) 77.9% Stacked CONV-WTA (256, 1024, 4096 maps) 80.1% Supervised Maxout Network [6] 88.3% Table 3.5: Unsupervised features + SVM (without fine-tuning)

FC-WTA autoencoder imposes sparsity across training examples (lifetime sparsity). When aiming for low sparsity levels, k-sparse autoencoders use a scheduling technique to avoid the dead dictionary atom problem. WTA autoencoders, however, do not have this problem since all the hidden units get updated upon visiting every mini-batch, no matter how aggressive the sparsity rate is (no scheduling required). As a result, we can train larger networks and achieve better classification rates. Relationship of CONV-WTA autoencoders to deconvolutional networks and convolutional PSD. Deconvolutional networks [74, 76] are top down models with no direct link from the image to the feature maps. The inference of the sparse maps requires solving the iterative ISTA algorithm, which is costly. Convolutional PSD [71] addresses this problem by training a parameterized encoder separately to explicitly predict the sparse codes using a soft thresholding operator. Deconvolutional networks and convolutional PSD can be viewed as the generative decoder and encoder paths of a convolutional autoencoder. Our contribution is to propose a specific winner-take-all approach for training a convolutional autoencoder, in which both paths are trained jointly using direct backpropagation yielding an algorithm that is much faster, easier to implement and can train much larger networks. Relationship to maxout networks. Maxout networks [6] take the max across different channels, whereas our method takes the max across space and mini-batch dimensions. Also the winner-take-all feature maps retain the location information of the “winners” within each feature map and different locations have different connectivity on the subsequent layers, whereas the maxout activity is passed to the next layer using weights that are the Chapter 3. Winner-Take-All Autoencoders 40

same regardless of which unit gave the maximum.

3.6 Conclusion

We proposed the winner-take-all spatial and lifetime sparsity methods to train autoencoders that learn to do fully-connected and convolutional sparse coding. We observed that CONV-WTA autoencoders learn shift-invariant and diverse dictionary atoms as opposed to position-specific Gabor-like atoms that are typically learnt by conventional sparse coding methods. Unlike related approaches, such as deconvolutional networks and convolutional PSD, our method jointly trains the encoder and decoder paths by direct back-propagation, and does not require an iterative EM-like optimization technique during training. We described how our method can be scaled to large datasets such as ImageNet and showed the necessity of the deep architecture to achieve better results. We performed experiments on the MNIST, SVHN and CIFAR-10 datasets and showed that the classification rates of winner-take-all autoencoders are competitive with the state-of-the-art. Chapter 3. Winner-Take-All Autoencoders 41

3.7 Appendix

3.7.1 Implementation Details

In this section, we describe the network architectures and hyper-parameters that were used in the experiments. While most of the conventional sparse coding algorithms require complex matrix operations such as matrix inversion or SVD decomposition, WTA autoencoders only require the sort operation in addition to matrix multiplication and convolution which are all efficiently implemented in most GPU libraries. We used Alex Krizhevsky’s cuda-convnet convolution kernels [1] for this work.

Deconvolution Kernels

At the decoder of a convolutional autoencoder, deconvolutional layers are used. The deconvolution operation is exactly the reverse of convolution (i.e., its forward pass is the backward pass of convolution). For example, whereas a strided convolution decreases the feature map size, a strided deconvolution increases the map size. We implemented the deconvolution kernels by minor modifications of current available GPU kernels for the convolution operation.

Effect of Tied Weights

We found that tying the encoder and decoder of FC-WTA autoencoders helps the generalization performance of them. But tying the convolution and deconvolution weights of CONV-WTA autoencoders hurts the generalization performance (data not shown). We think this is because the CONV-WTA autoencoder is already very regularized by the aggressive sparsity constraints and tying the weights results in too much regularization.

3.7.2 Experiment Details

CONV-WTA Autoencoder on MNIST

On the MNIST dataset, we trained two networks: Shallow CONV-WTA Autoencoder (128 maps). In the shallow architecture, we used 128 filters with a 7 × 7 receptive field applied at strides of 1 pixel. After training, we used max-pooling over 5×5 regions at strides of 3 pixels to obtain the final 128×10×10 representation. SVM was then applied to this representation for classification. Stacked CONV-WTA Autoencoder (128, 2048 maps). In the deep architec- ture, we trained another 2048 feature maps on top of the pooled feature maps of the Chapter 3. Winner-Take-All Autoencoders 42

first network, with a filter width of 3 applied at strides of 1 pixel. After training, we used max pooling over 3 × 3 regions at strides of 2 pixels to obtain the final 2048 × 5 × 5 representation. SVM was then applied to this representation for classification. Semi-Supervised CONV-WTA Autoencoder. In the semi-supervised setup, the amount of labeled data was varied from N = 300 to N = 60000. We ensured the dataset is balanced and each class has the same number of labeled points in all the experiments. We used the stacked CONV-WTA autoencoder (128, 2048 maps) trained in the previous part, and trained an SVM on top of the unsupervised features using only N labeled data.

CONV-WTA Autoencoder on SVHN

The Street View House Numbers (SVHN) dataset consists of about 600,000 images (both the difficult and the simple sets) and 26,000 test images. We first apply global contrast normalization to the images and then used local contrast normalization using a Gaussian kernel to preprocess each channel of the images. This is the same preprocessing that is used in [6]. The contrast normalized SVHN images are shown in Figure 3.9a. We trained two networks on this dataset. CONV-WTA Autoencoder (256 maps). The architecture used for this network is 256conv3-256conv3-256conv3-256deconv7. After training, we used max-pooling on the last 256 feature maps of the encoder, over 6 × 6 regions at strides of 4 pixels to obtain the final 256 × 8 × 8 representation. SVM was then applied to this representation for classification. We observed that having a stack of conv3 layers instead of a 256conv7 encoder, significantly improved the classification rate. Stacked CONV-WTA Autoencoder (256, 1024 maps). In the stacked archi- tecture, we trained another 1024 feature maps on top of the pooled feature maps of the first network, with a filter width of 3 applied at strides of 1 pixel. After training, we used max pooling over 3 × 3 regions at strides of 2 pixels to obtain the final 1024 × 4 × 4 representation. SVM was then applied to this representation for classification. Semi-Supervised CONV-WTA Autoencoder. In the semi-supervised setup, we assumed only N=1000 labeled data is available. We used the stacked CONV-WTA autoencoder (256, 1024 maps) trained in the previous part, and trained an SVM on top of the unsupervised features using only N = 1000 labeled data.

CONV-WTA Autoencoder on CIFAR-10

On the CIFAR-10 dataset, we used global contrast normalization followed by ZCA whitening with the regularization bias of 0.1 to preprocess the dataset. This is the same Chapter 3. Winner-Take-All Autoencoders 43

preprocessing that is used in [60]. We trained three networks on CIFAR-10. CONV-WTA Autoencoder (256 maps). The architecture used for this network is 256conv3-256conv3-256conv3-256deconv7. After training, we used max-pooling on the last 256 feature maps of the encoder, over 6 × 6 regions at strides of 4 pixels to obtain the final 256 × 8 × 8 representation. SVM was then applied to this representation for classification. Stacked CONV-WTA Autoencoder (256, 1024 maps). For this network, we trained another 1024 feature maps on top of the pooled feature maps of the first network, with a filter width of 3 applied at strides of 1 pixel. After training, we used max pooling over 3 × 3 regions at strides of 2 pixels to obtain the final 1024 × 4 × 4 representation. SVM was then applied to this representation for classification. Stacked CONV-WTA Autoencoder (256, 1024, 4096 maps). For this model, we first trained a CONV-WTA network with the architecture of 256conv3-256conv3- 256conv3-256deconv7. After training, we used max pooling on the last 256 feature maps of the encoder, over 3×3 regions at strides of 2 pixels to obtain a 256×16×16 representation. We then trained another 1024 feature maps with filter width of 3 and stride of 1 on top of the pooled feature maps of the first layer. We then obtained the second layer representation by max pooling the 1024 feature maps with a pooling stride of 2 and width of 3 to obtain a 1024 × 8 × 8 representation. We then trained another 4096 feature maps with filter width of 3 and the stride of 1 on top of the pooled feature maps of the second layer. Then we used max-pooling on the 4096 feature maps with a pooling width of 3 applied at strides of 2 pixels to obtain the final 4096 × 4 × 4 representation. An SVM was trained on top of the final representation for classification. Chapter 4

Adversarial Autoencoders

4.1 Introduction

Building scalable generative models to capture rich distributions such as audio, images or video is one of the central challenges of machine learning. In this chapter, we study autoencoders from a probabilistic perspective and propose new regularization techniques that can turn autoencoders into generative models. In particular, we propose the adver- sarial autoencoder (AAE), which is a probabilistic autoencoder that uses the recently proposed generative adversarial networks (GAN) [11] to perform variational inference. In our model, an autoencoder is trained with dual objectives – a traditional reconstruction error criterion, and an adversarial training criterion [11] that matches the aggregated posterior distribution of the latent representation of the autoencoder to an arbitrary prior distribution. Matching the aggregated posterior to the prior ensures that generating from any part of prior space results in meaningful samples. We show that the training criterion of AAE has a strong connection to VAE training. The result of the training is that the encoder learns to convert the data distribution to the prior distribution, while the decoder learns a deep generative model that maps the imposed prior to the data distribution. We show how the adversarial autoencoder can be used in applications such as semi-supervised classification, disentangling style and content of images, unsupervised clustering, dimensionality reduction and data visualization. We perform experiments on MNIST, Street View House Numbers and Toronto Face datasets and show that adversarial autoencoders achieve competitive results in generative modeling and semi-supervised classification tasks.

44 Chapter 4. Adversarial Autoencoders 45

Figure 4.1: Architecture of an adversarial autoencoder. The top row is a standard autoencoder that reconstructs an image x from a latent code z. The bottom row diagrams a second network trained to discriminatively predict whether a sample arises from the hidden code of the autoencoder or from a sampled distribution specified by the user.

4.2 Adversarial Autoencoders

Let x be the input and z be the latent code vector (hidden units) of an autoencoder with a deep encoder and decoder. Let p(z) be the prior distribution we want to impose on the codes, q(z|x) be an encoding distribution and p(x|z) be the decoding distribution.

Also let pd(x) be the data distribution, and p(x) be the model distribution. The encoding function of the autoencoder q(z|x) defines an aggregated posterior distribution of q(z) on the hidden code vector of the autoencoder as follows: Z q(z) = q(z|x)pd(x)dx (4.1) x

The adversarial autoencoder is an autoencoder that is regularized by matching the aggregated posterior, q(z), to an arbitrary prior, p(z). In order to do so, an adversarial network is attached on top of the hidden code vector of the autoencoder as illustrated in Figure 4.1. It is the adversarial network that guides q(z) to match p(z). The autoencoder, meanwhile, attempts to minimize the reconstruction error. The generator of the adversarial network is also the encoder of the autoencoder q(z|x). The encoder ensures the aggregated posterior distribution can fool the discriminative adversarial network into thinking that the hidden code q(z) comes from the true prior distribution p(z). Both, the adversarial network and the autoencoder are trained jointly with SGD in two phases – the reconstruction phase and the regularization phase – executed on each mini-batch. In the reconstruction phase, the autoencoder updates the encoder and the Chapter 4. Adversarial Autoencoders 46

decoder to minimize the reconstruction error of the inputs. In the regularization phase, the adversarial network first updates its discriminative network to tell apart the true samples (generated using the prior) from the generated samples (the hidden codes computed by the autoencoder). The adversarial network then updates its generator (which is also the encoder of the autoencoder) to confuse the discriminative network. Once the training procedure is done, the decoder of the autoencoder will define a generative model that maps the imposed prior of p(z) to the data distribution. There are several possible choices for the encoder, q(z|x), of adversarial autoencoders: Deterministic: Here we assume that q(z|x) is a deterministic function of x. In this case, the encoder is similar to the encoder of a standard autoencoder and the only source

of stochasticity in q(z) is the data distribution, pd(x). Gaussian posterior: Here we assume that q(z|x) is a Gaussian distribution whose

mean and variance is predicted by the encoder network: zi ∼ N (µi(x), σi(x)). In this case, the stochasticity in q(z) comes from both the data-distribution and the randomness of the Gaussian distribution at the output of the encoder. We can use the same re- parametrization trick of [10] for back-propagation through the encoder network. Universal approximator posterior: Adversarial autoencoders can be used to train the q(z|x) as the universal approximator of the posterior. Suppose the encoder network of the adversarial autoencoder is the function f(x, η) that takes the input x and a random noise η with a fixed distribution (e.g., Gaussian). We can sample from arbitrary posterior distribution q(z|x), by evaluating f(x, η) at different samples of η. In other words, we can assume q(z|x, η) = δ(z − f(x, η)) and the posterior q(z|x) and the aggregated posterior q(z) are defined as follows:

Z Z Z q(z|x) = q(z|x, η)pη(η)dη ⇒ q(z) = q(z|x, η)pd(x)pη(η)dηdx (4.2) η x η

In this case, the stochasticity in q(z) comes from both the data-distribution and the random noise η at the input of the encoder. Note that in this case the posterior q(z|x) is no longer constrained to be Gaussian and the encoder can learn any arbitrary posterior distribution for a given input x. Since there is an efficient method of sampling from the aggregated posterior q(z), the adversarial training procedure can match q(z) to p(z) by direct back-propagation through the encoder network f(x, η). Choosing different types of q(z|x) will result in different kinds of models with different training dynamics. For example, in the deterministic case of q(z|x), the network has to match q(z) to p(z) by only exploiting the stochasticity of the data distribution, but Chapter 4. Adversarial Autoencoders 47

since the empirical distribution of the data is fixed by the training set, and the mapping is deterministic, this might produce a q(z) that is not very smooth. However, in the Gaussian or universal approximator case, the network has access to additional sources of stochasticity that could help it in the adversarial regularization stage by smoothing out q(z). Nevertheless, after extensive hyper-parameter search, we obtained similar test-likelihood with each type of q(z|x). So in the rest of the chapter, we only report results with the deterministic version of q(z|x).

4.2.1 Relationship to Variational Autoencoders

Our work is similar in spirit to variational autoencoders [10]; however, while they use a KL divergence penalty to impose a prior distribution on the hidden code vector of the autoencoder, we use an adversarial training procedure to do so by matching the aggregated posterior of the hidden code vector with the prior distribution. VAE [10] minimizes the following upper-bound on the negative log-likelihood of x:

Ex∼pd(x)[− log p(x)] < Ex[Eq(z|x)[− log p(x|z)]] + Ex[KL(q(z|x)kp(z))]

= Ex[Eq(z|x)[− log p(x|z)]] − Ex[H(q(z|x))] + Eq(z)[− log p(z)] X = Ex[Eq(z|x)[− log p(x|z)]] − Ex[ log σi(x)] + Eq(z)[− log p(z)] + const. i = Reconstruction − Entropy + CrossEntropy(q(z), p(z)) (4.3) where the aggregated posterior q(z) is defined in Equation 4.1 and we have assumed q(z|x) is Gaussian and p(z) is an arbitrary distribution. The variational bound contains three terms. The first term can be viewed as the reconstruction term of an autoencoder and the second and third terms can be viewed as regularization terms. Without the regularization terms, the model is simply a standard autoencoder that reconstructs the input. However, in the presence of the regularization terms, the VAE learns a latent representation that is compatible with p(z). The second term of the cost function encourages large variances for the posterior distribution while the third term minimizes the cross-entropy between the aggregated posterior q(z) and the prior p(z). KL divergence or the cross-entropy term in Equation 4.3, encourages q(z) to pick the modes of p(z). In adversarial autoencoders, we replace the second two terms with an adversarial training procedure that encourages q(z) to match to the whole distribution of p(z). In this section, we compare the ability of the adversarial autoencoder to the VAE to Chapter 4. Adversarial Autoencoders 48

A C

E

B D

Figure 4.2: Comparison of adversarial and variational autoencoder on MNIST. The hidden code z of the hold-out images for an adversarial autoencoder fit to (a) a 2-D Gaussian and (b) a mixture of 10 2-D Gaussians. Each color represents the associated label. Same for variational autoencoder with (c) a 2-D gaussian and (d) a mixture of 10 2-D Gaussians. (e) Images generated by uniformly sampling the Gaussian percentiles along each hidden code dimension z in the 2-D Gaussian adversarial autoencoder.

impose a specified prior distribution p(z) on the coding distribution. Figure 4.2a shows the coding space z of the test data resulting from an adversarial autoencoder trained on MNIST digits in which a spherical 2-D Gaussian prior distribution is imposed on the hidden codes z. The learned manifold in Figure 4.2a exhibits sharp transitions indicating that the coding space is filled and exhibits no “holes”. In practice, sharp transitions in the coding space indicate that images generated by interpolating within z lie on the data manifold (Figure 4.2e). By contrast, Figure 4.2c shows the coding space of a VAE with the same architecture used in the adversarial autoencoder experiments. We can see that in this case the VAE roughly matches the shape of a 2-D Gaussian distribution. However, no data points map to several local regions of the coding space indicating that the VAE may not have captured the data manifold as well as the adversarial autoencoder. Figure 4.2b and Figure 4.2d show the code space of an adversarial autoencoder and of a VAE where the imposed distribution is a mixture of 10 2-D Gaussians. The adversarial autoencoder successfully matched the aggregated posterior with the prior distribution Chapter 4. Adversarial Autoencoders 49

(Figure 4.2b). In contrast, the VAE exhibit systematic differences from the mixture 10 Gaussians indicating that the VAE emphasizes matching the modes of the distribution as discussed above (Figure 4.2d).

An important difference between VAEs and adversarial autoencoders is that in VAEs, in order to back-propagate through the KL divergence by Monte-Carlo sampling, we need to have access to the exact functional form of the prior distribution. However, in AAEs, we only need to be able to sample from the prior distribution in order to induce q(z) to match p(z). In Section 4.2.3, we will demonstrate that the adversarial autoencoder can impose complicated distributions (e.g., swiss roll distribution) without having access to the explicit functional form of the distribution.

4.2.2 Relationship to GANs and GMMNs

In the original generative adversarial networks (GAN) paper [11], GANs were used to impose the data distribution at the pixel level on the output layer of a neural network. Adversarial autoencoders, however, rely on the autoencoder training to capture the data distribution. In adversarial training procedure of our method, a much simpler distribution (e.g., Gaussian as opposed to the data distribution) is imposed in a much lower dimensional space (e.g., 20 as opposed to 1000) which results in a better test-likelihood as is discussed in Section 4.3.

Generative moment matching networks (GMMN) [50] use the maximum mean discrep- ancy (MMD) objective to shape the distribution of the output layer of a neural network. The MMD objective can be interpreted as minimizing the distance between all moments of the model distribution and the data distribution. It has been shown that GMMNs can be combined with pre-trained dropout autoencoders to achieve better likelihood results (GMMN+AE). Our adversarial autoencoder also relies on the autoencoder to capture the data distribution. However, the main difference of our work with GMMN+AE is that the adversarial training procedure of our method acts as a regularizer that shapes the code distribution while training the autoencoder from scratch; whereas, the GMMN+AE model first trains a standard dropout autoencoder and then fits a distribution in the code space of the pre-trained network. In Section 4.3, we will show that the test-likelihood achieved by the joint training scheme of adversarial autoencoders outperforms the test-likelihood of GMMN and GMMN+AE on MNIST and Toronto Face datasets. Chapter 4. Adversarial Autoencoders 50

Figure 4.3: Regularizing the hidden code by providing a one-hot vector to the discrimina- tive network. The one-hot vector has an extra label for training points with unknown classes.

4.2.3 Incorporating Label Information in the Adversarial Regu- larization

In the scenarios where data is labeled, we can incorporate the label information in the adversarial training stage to better shape the distribution of the hidden code. In this section, we describe how to leverage partial or complete label information to regularize the latent representation of the autoencoder more heavily. To demonstrate this architecture we return to Figure 4.2b in which the adversarial autoencoder is fit to a mixture of 10 2-D Gaussians. We now aim to force each mode of the mixture of Gaussian distribution to represent a single label of MNIST. Figure 4.3 demonstrates the training procedure for this semi-supervised approach. We add a one-hot vector to the input of the discriminative network to associate the label with a mode of the distribution. The one-hot vector acts as switch that selects the corresponding decision boundary of the discriminative network given the class label. This one-hot vector has an extra class for unlabeled examples. For example, in the case of imposing a mixture of 10 2-D Gaussians (Figure 4.2b and Figure 4.4a), the one hot vector contains 11 classes. Each of the first 10 class selects a decision boundary for the corresponding individual mixture component. The extra class in the one-hot vector corresponds to unlabeled training points. When an unlabeled point is presented to the model, the extra class is turned on, to select the decision boundary for the full mixture of Gaussian distribution. During the positive phase of adversarial training, we provide the label of the mixture component (that the positive sample is drawn from) to the discriminator through the one-hot vector. The positive samples fed for unlabeled examples come from the full Chapter 4. Adversarial Autoencoders 51

A B

C D

Figure 4.4: Leveraging label information to better regularize the hidden code. Top Row: Training the coding space to match a mixture of 10 2-D Gaussians: (a) Coding space z of the hold-out images. (b) The manifold of the first 3 mixture components: each panel includes images generated by uniformly sampling the Gaussian percentiles along the axes of the corresponding mixture component. Bottom Row: Same but for a swiss roll distribution (see text). Note that labels are mapped in a numeric order (i.e., the first 10% of swiss roll is assigned to digit 0 and so on): (c) Coding space z of the hold-out images. (d) Samples generated by walking along the main swiss roll axis.

mixture of Gaussian, rather than from a particular class. During the negative phase, we provide the label of the training point image to the discriminator through the one-hot vector. Figure 4.4a shows the latent representation of an adversarial autoencoder trained with a prior that is a mixture of 10 2-D Gaussians trained on 10K labeled MNIST examples and 40K unlabeled MNIST examples. In this case, the i-th mixture component of the prior has been assigned to the i-th class in a semi-supervised fashion. Figure 4.4b shows the manifold of the first three mixture components. Note that the style representation is consistently represented within each mixture component, independent of its class. For example, the upper-left region of all panels in Figure 4.4b correspond to the upright writing style and lower-right region of these panels correspond to the tilted writing style of digits. This method may be extended to arbitrary distributions with no parametric forms – as demonstrated by mapping the MNIST data set onto a “swiss roll” (a conditional Gaussian distribution whose mean is uniformly distributed along the length of a swiss roll axis). Figure 4.4c depicts the coding space z and Figure 4.4d highlights the images Chapter 4. Adversarial Autoencoders 52

(a) MNIST samples (8-D Gaussian) (b) TFD samples (15-D Gaussian)

Figure 4.5: Samples generated from an adversarial autoencoder trained on MNIST and Toronto Face dataset (TFD). The last column shows the closest training images in pixel-wise Euclidean distance to those in the second-to-last column.

generated by walking along the swiss roll axis in the latent space.

4.3 Likelihood Analysis of Adversarial Autoencoders

The experiments presented in the previous sections have only demonstrated qualitative results. In this section we measure the ability of the AAE as a generative model to capture the data distribution by comparing the likelihood of this model to generate hold-out images on the MNIST and Toronto face dataset (TFD) using the evaluation procedure described in [11]. We trained an adversarial autoencoder on MNIST and TFD in which the model imposed a high-dimensional Gaussian distribution on the underlying hidden code. Figure 4.5 shows samples drawn from the adversarial autoencoder trained on these datasets. A video showing the learnt TFD manifold can be found at this link. To determine whether the model is over-fitting by copying the training data points, we used the last column of these figures to show the nearest neighbors, in Euclidean distance, to the generative model samples in the second-to-last column. We evaluate the performance of the adversarial autoencoder by computing its log- likelihood on the hold out test set. Evaluation of the model using likelihood is not straightforward because we can not directly compute the probability of an image. Thus, we calculate a lower bound of the true log-likelihood using the methods described in prior Chapter 4. Adversarial Autoencoders 53

MNIST (10K) MNIST (10M) TFD (10K) TFD (10M) DBN [35] 138 ± 2 - 1909 ± 66 - Stacked CAE [84] 121 ± 1.6 - 2110 ± 50 - Deep GSN [85] 214 ± 1.1 - 1890 ± 29 - GAN [11] 225 ± 2 386 2057 ± 26 - GMMN + AE [50] 282 ± 2 - 2204 ± 20 - Adversarial Autoencoder 340 ± 2 427 2252 ± 16 2522

Table 4.1: Log-likelihood of test data on MNIST and Toronto Face dataset. Higher values are better. On both datasets we report the Parzen window estimate of the log-likelihood obtained by drawing 10K or 10M samples from the trained model. For MNIST, we compare against other models on the real-valued version of the dataset. work [84, 85, 11]. We fit a Gaussian Parzen window (kernel density estimator) to 10, 000 samples generated from the model and compute the likelihood of the test data under this distribution. The free-parameter σ of the Parzen window is selected via cross-validation. Table 4.1 compares the log-likelihood of the adversarial autoencoder for real-valued MNIST and TFD to many state-of-the-art methods including DBN [35], Stacked CAE [84], Deep GSN [85], Generative Adversarial Networks [11] and GMMN + AE [50]. Note that the Parzen window estimate is a lower bound on the true log-likelihood and the tightness of this bound depends on the number of samples drawn. To obtain a comparison with a tighter lower bound, we additionally report Parzen window estimates evaluated with 10 million samples for both the adversarial autoencoder and the generative adversarial network [11]. In all comparisons we find that the adversarial autoencoder achieves superior log-likelihoods to competing methods. However, the reader must be aware that the metrics currently available for evaluating the likelihood of generative models such as GANs are deeply flawed. Theis et al. [86] detail the problems with such metrics, including the 10K and 10M sample Parzen window estimate.

4.4 Supervised Adversarial Autoencoders

Semi-supervised learning is a long-standing conceptual problem in machine learning. Recently, generative models have become one of the most popular approaches for semi- supervised learning as they can disentangle the class label information from many other latent factors of variation in a principled way [21, 87]. In this section, we first focus on the fully supervised scenarios and discuss an architec- Chapter 4. Adversarial Autoencoders 54

Figure 4.6: Disentangling the label information from the hidden code by providing the one-hot vector to the generative model. The hidden code in this case learns to represent the style of the image.

ture of adversarial autoencoders that can separate the class label information from the image style information. We then extend this architecture to the semi-supervised settings in Section 4.5. In order to incorporate the label information, we alter the network architecture of Figure 4.1 to provide a one-hot vector encoding of the label to the decoder (Figure 4.6). The decoder utilizes both the one-hot vector identifying the label and the hidden code z

(a) MNIST (b) SVHN Figure 4.7: Disentangling content and style (15-D Gaussian) on MNIST and SVHN datasets. Chapter 4. Adversarial Autoencoders 55

to reconstruct the image. This architecture forces the network to retain all information independent of the label in the hidden code z. Figure 4.7a demonstrates the results of such a network trained on MNIST digits in which the hidden code is forced into a 15-D Gaussian. Each row of Figure 4.7a presents reconstructed images in which the hidden code z is fixed to a particular value but the label is systematically explored. Note that the style of the reconstructed images is consistent across a given row. Figure 4.7b demonstrates the same experiment applied to Street View House Numbers dataset [75]. A video showing the learnt SVHN style manifold can be found at this link. In this experiment, the one-hot vector represents the label associated with the central digit in the image. Note that the style information in each row contains information about the labels of the left-most and right-most digits because the left-most and right-most digits are not provided as label information in the one-hot encoding.

4.5 Semi-Supervised Adversarial Autoencoders

Building on the foundations from Section 4.4, we now use the adversarial autoencoder to develop models for semi-supervised learning that exploit the generative description of the unlabeled data to improve the classification performance that would be obtained by using only the labeled data. Specifically, we assume the data is generated by a latent class variable y that comes from a Categorical distribution as well as a continuous latent variable z that comes from a Gaussian distribution:

p(y) = Cat(y) p(z) = N (z|0, I) (4.4)

We alter the network architecture of Figure 4.6 so that the inference network of the AAE predicts both the discrete class variable y and the continuous latent variable z using the encoder q(z, y|x) (Figure 4.8). The decoder then utilizes both the class label as a one-hot vector and the continuous hidden code z to reconstruct the image. There are two separate adversarial networks that regularize the hidden representation of the autoencoder. The first adversarial network imposes a Categorical distribution on the label representation. This adversarial network ensures that the latent class variable y does not carry any style information and that the aggregated posterior distribution of y matches the Categorical distribution. The second adversarial network imposes a Gaussian distribution on the style representation which ensures the latent variable z is a continuous Gaussian variable. Both of the adversarial networks as well as the autoencoder are trained jointly with Chapter 4. Adversarial Autoencoders 56 Style

Figure 4.8: Semi-Supervised AAE: the top adversarial network imposes a Categorical distribution on the label representation and the bottom adversarial network imposes a Gaussian distribution on the style representation. q(y|x) is trained on the labeled data in the semi-supervised settings.

SGD in three phases – the reconstruction phase, regularization phase and the semi- supervised classification phase. In the reconstruction phase, the autoencoder updates the encoder q(z, y|x) and the decoder to minimize the reconstruction error of the inputs on an unlabeled mini-batch. In the regularization phase, each of the adversarial networks first updates their discriminative network to tell apart the true samples (generated using the Categorical and Gaussian priors) from the generated samples (the hidden codes computed by the autoencoder). The adversarial networks then update their generator to confuse their discriminative networks. In the semi-supervised classification phase, the autoencoder updates q(y|x) to minimize the cross-entropy cost on a labeled mini-batch. The results of semi-supervised classification experiments on MNIST and SVHN datasets are reported in Table 4.2. On the MNIST dataset with 100 and 1000 labels, the performance of AAEs is significantly better than VAEs, on par with VAT [88] and CatGAN [89], but is outperformed by the Ladder networks [90] and the ADGM [87]. We also trained a supervised AAE model on all the available labels, and obtained the error rate of 0.85%. In comparison, a dropout supervised neural network with the same architecture achieves the error rate of 1.25% on the full MNIST dataset, which highlights the regularization effect of the adversarial training. On the SVHN dataset with 1000 labels, the AAE almost matches the state-of-the-art classification performance achieved by the ADGM. Chapter 4. Adversarial Autoencoders 57

MNIST MNIST MNIST SVHN (100) (1000 labels) (All labels) (1000 labels) NN Baseline 25.80 8.73 1.25 47.50 VAE (M1) + TSVM 11.82 (±0.25) 4.24 (±0.07) - 55.33 (±0.11) VAE (M2) 11.97 (±1.71) 3.60 (±0.56) - - VAE (M1 + M2) 3.33 (±0.14) 2.40 (±0.02) 0.96 36.02 (±0.10) VAT 2.33 1.36 0.64 (±0.04) 24.63 CatGAN 1.91 (±0.1) 1.73 (±0.18) 0.91 - Ladder Networks 1.06 (±0.37) 0.84 (±0.08) 0.57 (±0.02) - ADGM 0.96 (±0.02) -- 16.61 (±0.24) Adversarial Autoencoders 1.90 (±0.10) 1.60 (±0.08) 0.85 (±0.02) 17.70 (±0.30)

Table 4.2: Semi-supervised classification performance (error-rate) on MNIST and SVHN.

It is also worth mentioning that all the AAE models are trained end-to-end, whereas the semi-supervised VAE models have to be trained one layer at a time [21].

4.6 Clustering with Adversarial Autoencoders

In the previous section, we showed that with a limited label information, the adversarial autoencoder is able to learn powerful semi-supervised representations. However, the question that has remained unanswered is whether it is possible to learn as “powerful” representations from unlabeled data without any supervision. In this section, we show that the adversarial autoencoder can disentangle discrete class variables from the continuous latent style variables in a purely unsupervised fashion. The architecture that we use is similar to Figure 4.8, with the difference that we remove the semi-supervised classification stage and thus no longer train the network on

Figure 4.9: Unsupervised clustering of MNIST using the AAE with 16 clusters. Each row corresponds to one cluster with the first image being the cluster head. (see text) Chapter 4. Adversarial Autoencoders 58

MNIST (Unsupervised) CatGAN [89](20 clusters) 9.70 Adversarial Autoencoder (16 clusters) 9.55 (±2.05) Adversarial Autoencoder (30 clusters) 4.10 (±1.13)

Table 4.3: Unsupervised clustering performance (error-rate) of the AAE on MNIST.

any labeled mini-batch. Another difference is that the inference network q(y|x) predicts a one-hot vector whose dimension is the number of categories that we wish the data to be clustered into. Figure 4.9 illustrates the unsupervised clustering performance of the AAE on MNIST when the number of clusters is 16. Each row corresponds to one cluster. The first image in each row shows the cluster heads, which are digits generated by fixing the style variable to zero and setting the label variable to one of the 16 one-hot vectors. The rest of the images in each row are random test images that have been categorized into the corresponding category based on q(y|x). We can see that the AAE has picked up some discrete styles as the class labels. For example, the digit 1s and 6s that are tilted (cluster 16 and 11) are put in a separate cluster than the straight 1s and 6s (cluster 15 and 10), or the network has separated digit 2s into two clusters (cluster 4, 6) depending on whether the digit is written with a loop. We performed an experiment to evaluate the unsupervised clustering performance of AAEs. We used the following evaluation protocol: Once the training is done, for each

cluster i, we found the validation example xn that maximizes q(yi|xn), and assigned the

label of xn to all the points in the cluster i. We then computed the test error based on the assigned class labels to each cluster. As shown in Table 4.3, the AAE achieves the classification error rate of 9.55% and 4.10% with 16 and 30 total labels respectively. We observed that as the number of clusters grows, the classification rate improves.

4.7 Dimensionality Reduction with Adversarial Autoen- coders

Visualization of high dimensional data is a very important problem in many applications as it facilitates the understanding of the generative process of the data and allows us to extract useful information about the data. A popular approach of data visualization is learning a low dimensional embedding in which nearby points correspond to similar objects. Over the last decade, a large number of new non-parametric dimensionality reduction techniques such as t-SNE [91] have been proposed. The main drawback of Chapter 4. Adversarial Autoencoders 59 Style

Figure 4.10: Dimensionality reduction with adversarial autoencoders: There are two separate adversarial networks that impose Categorical and Gaussian distribution on the latent representation. The final n dimensional representation is constructed by first mapping the one-hot label representation to an n dimensional cluster head representation and then adding the result to an n dimensional style representation. The cluster heads are learned by SGD with an additional cost function that penalizes the Euclidean distance between of every two of them.

these methods is that they do not have a parametric encoder that can be used to find the embedding of the new data points. Different methods such as parametric t-SNE [92] have been proposed to address this issue. Autoencoders are interesting alternatives as they provide the non-linear mapping required for such embeddings; but it is widely known that non-regularized autoencoders “fracture” the manifold into many different domains which result in very different codes for similar images [93]. In this section, we present an adversarial autoencoder architecture for dimensionality reduction and data visualization purposes. We will show that in these autoencoders, the adversarial regularization attaches the hidden code of similar images to each other and thus prevents the manifold fracturing problem that is typically encountered in the embeddings learnt by the autoencoders. Suppose we have a dataset with m class labels and we would like to reduce the dimensionality of the dataset to n, where n is typically 2 or 3 for the visualization purposes. We alter the architecture of Figure 4.8 to Figure 4.10 in which the final representation is achieved by adding the n dimensional distributed representation of the cluster head with the n dimensional style representation. The cluster head representation Chapter 4. Adversarial Autoencoders 60

is obtained by multiplying the m dimensional one-hot class label vector by an m × n

matrix WC , where the rows of WC represent the m cluster head representations that are learned with SGD. We introduce an additional cost function that penalizes the Euclidean distance between every two cluster heads. Specifically, if the Euclidean distance is larger than a threshold η, the cost function is zero, and if it is smaller than η, the cost function linearly penalizes the distance.

Figure 4.11a,b show the results of the semi-supervised dimensionality reduction in n = 2 dimensions on the MNIST dataset (m = 10) with 1000 and 100 labels. We can see that the network can achieve a clean separation of the digit clusters and obtain the semi-supervised classification error of 4.20% and 6.08% respectively. Note that because of the 2D constraint, the classification error is not as good as the high-dimensional cases; and that the style distribution of each cluster is not quite Gaussian.

Figure 4.11c shows the result of unsupervised dimensionality reduction in n = 2 dimensions where the number of clusters have chosen to be m = 20. We can see that the network can achieve a rather clean separation of the digit clusters and sub-clusters. For example, the network has assigned two different clusters to digit 1 (green clusters) depending on whether the digit is straight or tilted. The network is also clustering digit 6 into three clusters (black clusters) depending on how much tilted the digit is. Also the network has assigned two separate clusters for digit 2 (red clusters), depending on whether the digit is written with a loop.

This AAE architecture (Figure 4.10) can also be used to embed images into larger dimensionalities (n > 2). For example, Figure 4.11d shows the result of semi-supervised

dimensionality reduction in n = 10 dimensions with 100 labels. In this case, we fixed WC

matrix to WC = 10I and thus the cluster heads are the corners of a 10 dimensional simplex. The style representation is learnt to be a 10D Gaussian distribution with the standard deviation of 1 and is directly added to the cluster head to construct the final representation. Once the network is trained, in order to visualize the 10D learnt representation, we use a linear transformation to map the 10D representation to a 2D space such that the cluster heads are mapped to the points that are uniformly placed on a 2D circle. We can verify from this figure that in this high-dimensional case, the style representation has indeed learnt to have a Gaussian distribution. With 100 total labels, this model achieves the classification error-rate of 3.90% which is worse than the classification error-rate of 1.90% that is achieved by the AAE architecture with the concatenated style and label representation (Figure 4.8). Chapter 4. Adversarial Autoencoders 61

(a) 2D representation with 1000 labels (4.2% error) (b) 2D representation with 100 labels (6.08% error)

(c) 2D representation learnt in an unsupervised (d) 10D representation with 100 labels projected fashion with 20 clusters (13.95% error) onto the 2D space (3.90% error)

Figure 4.11: Semi-Supervised and Unsupervised Dimensionality Reduction with AAEs on MNIST.

4.8 Conclusion

In this chapter, we proposed to use the GAN framework as a variational inference algorithm for both discrete and continuous latent variables in probabilistic autoencoders. Our method called the adversarial autoencoder (AAE), is a generative autoencoder that achieves competitive test likelihoods on real-valued MNIST and Toronto Face datasets. Chapter 4. Adversarial Autoencoders 62

We discussed how this method can be extended to semi-supervised scenarios and showed that it achieves competitive semi-supervised classification performance on MNIST and SVHN datasets. Finally, we demonstrated the applications of adversarial autoencoders in disentangling the style and content of images, unsupervised clustering, dimensionality reduction and data visualization. Chapter 4. Adversarial Autoencoders 63

4.9 Appendix

4.9.1 Likelihood Experiments

The encoder, decoder and discriminator each have two layers of 1000 hidden units with ReLU activation function. The activation of the last layer of q(z|x) is linear. The weights are initialized with a Gaussian distribution with the standard deviation of 0.01. The mini-batch size is 100. The autoencoder is trained with a Euclidean cost function for reconstruction. On the MNIST dataset we use the sigmoid activation function in the last layer of the autoencoder and on the TFD dataset we use the linear activation function. The dimensionality of the hidden code z is 8 and 15 and the standard deviation of the Gaussian prior is 5 and 10 for MNIST and TFD, respectively. On the Toronto Face dataset, data points are subtracted by the mean and divided by the standard deviation along each input dimension across the whole training set to normalize the contrast. However, after obtaining the samples, we rescaled the images (by inverting the pre-processing stage) to have pixel intensities between 0 and 1 so that we can have a fair likelihood comparison with other methods. In the deterministic case of q(z|x), the dimensionality of the hidden code should be consistent with the intrinsic dimensionality of the data, since the only source of stochasticity in q(z) is the data distribution. For example, in the case of MNIST, the dimensionality of the hidden code can be between 5 to 8, and for TFD and SVHN, it can be between 10 to 20. For training AAEs with higher dimensionalities in the code space (e.g., 1000), the probabilistic q(z|x) along with the re-parametrization trick can be used.

4.9.2 Semi-Supervised Experiments

MNIST

The encoder, decoder and discriminator each have two layers of 1000 hidden units with ReLU activation function. The last layer of the autoencoder can have a linear or sigmoid activation (sigmoid is better for sample visualization). The cost function is half the Euclidean error. The last layer of q(y|x) and q(z|x) has the softmax and linear activation function, respectively. The q(y|x) and q(z|x) share the first two 1000-unit layers of the encoder. The dimensionality of both the style and label representation is 10. On the style representation, we impose a Gaussian distribution with the standard deviation of 1. On the label representation, we impose a Categorical distribution. The semi-supervised cost is a cross-entropy cost function at the output of q(y|x). We use gradient descent with momentum for optimizing all the cost functions. The momentum value for the autoencoder Chapter 4. Adversarial Autoencoders 64

reconstruction cost and the semi-supervised cost is fixed to 0.9. The momentum value for the generator and discriminator of both of the adversarial networks is fixed to 0.1. For the reconstruction cost, we use the initial learning rate of 0.01, after 50 epochs reduce it to 0.001 and after 1000 epochs reduce it to 0.0001. For the semi-supervised cost, we use the initial learning rate of 0.1, after 50 epochs reduce it to 0.01 and after 1000 epochs reduce it to 0.001. For both the discriminative and generative costs of the adversarial networks, we use the initial learning rate of 0.1, after 50 epochs reduce it to 0.01 and after 1000 epochs reduce it to 0.001. We train the network for 5000 epochs. We add a Gaussian noise with standard deviation of 0.3 only to the input layer and only at the

training time. No dropout, `2 weight decay or other Gaussian noise regularization were used in any other layer. The labeled examples were chosen at random, but we made sure they are distributed evenly across the classes. In the case of MNIST with 100 labels, the test error after the first epochs is 16.50%, after 50 epochs is 3.40%, after 500 epochs is 2.21% and after 5000 epochs is 1.80%. Batch-normalization [94] did not help in the case of the MNIST dataset.

SVHN

The SVHN dataset has about 530K training points and 26K test points. Data points are subtracted by the mean and divided by the standard deviation along each input dimension across the whole training set to normalize the contrast. The dimensionality of the label representation is 10 and for the style representation we use 20 dimensions. We use gradient descent with momentum for optimizing all the cost functions. The momentum value for the autoencoder reconstruction cost and the semi-supervised cost is fixed to 0.9. The momentum value for the generator and discriminator of both of the adversarial networks is fixed to 0.1. For the reconstruction cost, we use the initial learning rate of 0.0001 and after 250 epochs reduce it to 0.00001. For the semi-supervised cost, we use the initial learning rate of 0.1 and after 250 epochs reduce it to 0.01. For both the discriminative and generative costs of the adversarial networks, we use the initial learning rate of 0.01 and after 250 epochs reduce it to 0.001. We train the network for 1000 epochs. We use

dropout at the input layer with the dropout rate of 20%. No other dropout, `2 weight decay or Gaussian noise regularization were used in any other layer. The labeled examples were chosen at random, but we made sure they are distributed evenly across the classes. In the case of SVHN with 1000 labels, the test error after the first epochs is 49.34%, after 50 epochs is 25.86%, after 500 epochs is 18.15% and after 1000 epochs is 17.66%. Batch-normalization were used in all the autoencoder layers including the softmax layer of q(y|x), the linear layer of q(z|x) as well as the linear output layer of the autoencoder. Chapter 4. Adversarial Autoencoders 65

We found batch-normalization [94] to be crucial in training the AAE network on the SVHN dataset.

4.9.3 Unsupervised Clustering Experiments

The encoder, decoder and discriminator each have two layers of 3000 hidden units with ReLU activation function. The last layer of the autoencoder has a sigmoid activation function. The cost function is half the Euclidean error. The dimensionality of the style and label representation is 5 and 30 (number of clusters), respectively. On the style representation, we impose a Gaussian distribution with the standard deviation of 1. On the label representation, we impose a Categorical distribution. We use gradient descent with momentum for optimizing all the cost functions. The momentum value for the autoencoder reconstruction cost is fixed to 0.9. The momentum value for the generator and discriminator of both of the adversarial networks is fixed to 0.1. For the reconstruction cost, we use the initial learning rate of 0.01 and after 50 epochs reduce it to 0.001. For both the discriminative and generative costs of the adversarial networks, we use the initial learning rate of 0.1 and after 50 epochs reduce it to 0.01. We train the network for 1500 epochs. We use dropout at the input layer with the dropout rate of 20%. No other

dropout, `2 weight decay or Gaussian noise regularization were used in any other layer. Batch-normalization was used only in the encoder layers of the autoencoder including the last layer of q(y|x) and q(z|x). We found batch-normalization [94] to be crucial in training the AAE networks for unsupervised clustering. Chapter 5

PixelGAN Autoencoders

5.1 Introduction

In recent years, generative models that can be trained via direct back-propagation have enabled remarkable progress in modeling natural images. In Chapter 1, we reviewed some of these models such as generative adversarial network (GAN) [11], variational autoencoders (VAE) [10, 37], and PixelCNNs [56]. In this chapter, we present the PixelGAN autoencoder as a probabilistic autoencoder that combines the benefits of latent variable models with autoregressive architectures. The PixelGAN autoencoder is a generative autoencoder in which the generative path is a PixelCNN that is conditioned on a latent variable. The latent variable is inferred by matching the aggregated posterior distribution to the prior distribution by an adversarial training technique similar to that of the adversarial autoencoder (Chapter 4). However, whereas in adversarial autoencoders the statistics of the data distribution are captured by the latent code, in the PixelGAN autoencoder they are captured jointly by the latent code and the autoregressive decoder. We show that imposing different distributions as the prior results in different factorizations of information between the latent code and the autoregressive decoder. For example, in Section 5.2.1, we show that by imposing a Gaussian distribution on the latent code, we can achieve a global vs. local decomposition of information. In this case, the global latent code no longer has to model all the irrelevant and fine details of the image, and can use its capacity to capture more relevant and global statistics of the image. Another type of decomposition of information that can be learnt by PixelGAN autoencoders is a discrete vs. continuous decomposition. In Section 5.2.2, we show that we can achieve this decomposition by imposing a categorical prior on the latent code using adversarial training. In this case, the categorical latent code captures the discrete underlying factors of variation in the data, such as class label information,

66 Chapter 5. PixelGAN Autoencoders 67

Figure 5.1: Architecture of the PixelGAN autoencoder.

and the autoregressive decoder captures the remaining continuous structure, such as style information, in an unsupervised fashion. We then show how PixelGAN autoencoders with categorical priors can be directly used in clustering and semi-supervised scenarios and achieve very competitive classification results on several datasets in Section 5.3. Finally, we present one of the main potential applications of PixelGAN autoencoders in learning cross-domain relations between two different domains in Section 5.4.

5.2 PixelGAN Autoencoders

Let x be a datapoint that comes from the distribution pd(x) and z be the hidden code. The recognition path of the PixelGAN autoencoder (Figure 5.1) defines an implicit posterior distribution q(z|x) by using a deterministic neural function z = f(x, n) that takes the input x along with random noise n with a fixed distribution p(n) and outputs z. The aggregated posterior q(z) of this model is defined as follows: Z q(z) = q(z|x)pd(x)dx. (5.1) x

This parametrization of the implicit posterior distribution was originally proposed in the adversarial autoencoder work [46] as the universal approximator posterior. We can sample from this implicit distribution q(z|x), by evaluating f(x, n) at different samples of n, but the density function of this posterior distribution is intractable. Appendix 5.6.1 discusses the importance of the input noise in training PixelGAN autoencoders. The generative path p(x|z) is a conditional PixelCNN [56] that conditions on the latent vector z using an adaptive bias in PixelCNN layers. The inference is done by an amortized GAN Chapter 5. PixelGAN Autoencoders 68

inference technique that was originally proposed in the adversarial autoencoder work [46]. In this method, an adversarial network is attached on top of the hidden code vector of the autoencoder and matches the aggregated posterior distribution, q(z), to an arbitrary prior, p(z). Samples from q(z) and p(z) are provided to the adversarial network as the negative and positive examples respectively, and the generator of the adversarial network, which is also the encoder of the autoencoder, tries to match q(z) to p(z) by the gradient that comes through the discriminative adversarial network. The adversarial network, the PixelCNN decoder and the encoder are trained jointly in two phases – the reconstruction phase and the adversarial phase – executed on each mini-batch. In the reconstruction phase, the ground truth input x along with the hidden code z inferred by the encoder are provided to the PixelCNN decoder. The PixelCNN decoder weights are updated to maximize the log-likelihood of the input x. The encoder weights are also updated at this stage by the gradient that comes through the conditioning vector of the PixelCNN. In the adversarial phase, the adversarial network updates both its discriminative network and its generative network (the encoder) to match q(z) to p(z). Once the training is done, we can sample from the model by first sampling z from the prior distribution p(z), and then sampling from the conditional likelihood p(x|z) parametrized by the PixelCNN decoder. We now establish a connection between the PixelGAN autoencoder cost and maximum likelihood learning using a decomposition of the aggregated evidence lower bound (ELBO) proposed in [95]:

h i h i Ex∼pd(x)[log p(x)] > −Ex∼pd(x) Eq(z|x)[− log p(x|z)] − Ex∼pd(x) KL(q(z|x)kp(z)) (5.2) h i = − Ex∼pd(x) Eq(z|x)[− log p(x|z)] − KL(q(z)kp(z)) − I(z; x) (5.3) | {z } | {z } | {z } marginal KL mutual info. reconstruction term

The first term in Equation 5.3 is the reconstruction term and the second term is the marginal KL divergence between the aggregated posterior and the prior distribution. The third term is the mutual information between the latent code z and the input x. This is a regularization term that encourages z and x to be decoupled by removing the information of the data distribution from the hidden code. If the training set has N examples, I(z; x) is bounded as follows (see [95]).

0 < I(z; x) < log N (5.4) In order to maximize the ELBO, we need to minimize all the three terms of Equation 5.3. We consider two cases for the decoder p(x|z): Chapter 5. PixelGAN Autoencoders 69

Deterministic Decoder. If the decoder p(x|z) is deterministic or has very limited stochasticity such as the simple factorized decoder of the VAE, the mutual information term acts in the complete opposite direction of the reconstruction term. This is because the only way to minimize the reconstruction error of x is to learn a hidden code z that is relevant to x, which results in maximizing I(z; x). Indeed, it can be shown that minimizing the reconstruction term maximizes a variational lower bound on I(z; x) [96, 97]. For example, in the case of the VAE trained on MNIST, since the reconstruction is precise, the mutual information term is dominated and is close to its maximum value I(z; x) ≈ log N ≈ 11.00 nats [95]. Stochastic Decoder. If we use a powerful decoder such as the PixelCNN, the reconstruction term and the mutual information term will not compete with each other anymore and the network can minimize both independently. In this case, the optimal

solution for maximizing the ELBO would be to model pd(x) solely by p(x|z) and thereby minimizing the reconstruction term, and at the same time, minimizing the mutual information term by ignoring the latent code. As a result, even though the model achieves a high likelihood, the latent code does not learn any useful representation, which is undesirable. This problem has been observed in several previous works [98, 99] and different techniques such as annealing the weight of the KL term [98] or weakening the decoder [99] have been proposed to make z and x more dependent. As suggested in [100, 99], we think that the maximum likelihood objective by itself is not a useful objective for representation learning especially when a powerful decoder is used. In PixelGAN autoencoders, in order to encourage learning more useful representations, we modify the ELBO (Equation 5.3) by removing the mutual information term from it, since this term is explicitly encouraging z to become independent of x. So our cost function only includes the reconstruction term and the marginal KL term. The reconstruction term is optimized by the reconstruction phase of training and the marginal KL term is approximately optimized by the adversarial phase1. Note that since the mutual information term is upper bounded by a constant (log N), we are still maximizing a lower bound on the log-likelihood of data. However, this bound is weaker than the ELBO, which is the price that is paid for learning more useful latent representations by balancing the decomposition of information between the latent code and the autoregressive decoder. For implementing the conditioning adaptive bias in the PixelCNN decoder, we explore two different architectures [56]. In the location-invariant bias, for each PixelCNN layer, we use the latent code to construct a vector that is broadcasted within each feature map

1The original GAN formulation optimizes the Jensen-Shannon divergence [11], but there are other formulations that optimize the KL divergence, e.g. [43]. Chapter 5. PixelGAN Autoencoders 70

(a) PixelGAN Samples (b) PixelCNN Samples (c) AAE Samples (2D code, limited receptive field) (limited receptive field) (2D code)

Figure 5.2: (a) Samples of the PixelGAN autoencoder with 2D Gaussian code and limited receptive field of size 9. (b) Samples of the PixelCNN (c) Samples of the adversarial autoencoder.

of the layer and then added as an adaptive bias to that layer. In the location-dependent bias, we use the latent code to construct a spatial feature map that is broadcasted across different feature maps and then added only to the first layer of the decoder as an adaptive bias. We will discuss the effect of these architectures on the learnt representation in Figure 5.3 of Section 5.2.1 and their implementation details in Appendix 5.6.1.

5.2.1 PixelGAN Autoencoders with Gaussian Priors

Here, we show that PixelGAN autoencoders with Gaussian priors can decompose the global and local statistics of the images between the latent code and the autoregressive decoder. Figure 5.2a shows the samples of a PixelGAN autoencoder model with the location- dependent bias trained on the MNIST dataset. For the purpose of better illustrating the decomposition of information, we have chosen a 2-D Gaussian latent code and a limited the receptive field of size 9 for the PixelGAN autoencoder. Figure 5.2b shows the samples of a PixelCNN model with the same limited receptive field size of 9 and Figure 5.2c shows the samples of an adversarial autoencoder with the 2-D Gaussian latent code. The PixelCNN can successfully capture the local statistics, but fails to capture the global statistics due to the limited receptive field size. In contrast, the adversarial autoencoder, whose sample quality is very similar to that of the VAE, can successfully capture the global statistics, but fails to generate the details of the images. However, the PixelGAN autoencoder, with the same receptive field and code size, can combine the best of both and generates sharp images with global statistics. In PixelGAN autoencoders, both the PixelCNN depth and the conditioning architecture Chapter 5. PixelGAN Autoencoders 71

(a) Shallow PixelCNN (b) Shallow PixelCNN (c) Deep PixelCNN (d) Deep PixelCNN Location-invariant bias Location-dependent bias Location-invariant bias Location-dependent bias Figure 5.3: The effect of the PixelCNN decoder depth and the conditioning architecture on the learnt representation of the PixelGAN autoencoder. (Shallow=3 ResBlocks, Deep=12 ResBlocks)

affect the decomposition of information between the latent code and the autoregressive decoder. We investigate these effects in Figure 5.3 by training a PixelGAN autoencoder on MNIST where the code size is chosen to be 2 for the visualization purpose. As shown in Figure 5.3a,b, when a shallow decoder is used, most of the information will be encoded in the hidden code and there is a clean separation between the digit clusters. As we make the PixelCNN more powerful (Figure 5.3c,d), we can see that the hidden code is still used to capture some relevant information of the input, but the separation of digit clusters is not as sharp when the limited code size of 2 is used. In the next section, we will show that by using a larger code size (e.g., 30), we can get a much better separation of digit clusters even when a powerful PixelCNN is used. The conditioning architecture also affects the decomposition of information. In the case of the location-invariant bias, the hidden code is encouraged to learn the global information that is location-invariant (the what information and not the where information) such as the class label information. For example, we can see in Figure 5.3a,c that the network has learnt to use one of the axes of the 2D Gaussian code to explicitly encode the digit label even though a continuous prior is imposed. In this case, we can potentially get a much better separation if we impose a discrete prior. This makes this architecture suitable for the discrete vs. continuous decomposition and we use it for our clustering and semi-supervised learning experiments. In the case of the location-dependent bias (Figure 5.3b,d), the hidden code is encouraged to learn the global information that has location dependent information such as low-frequency content of the image, similar to what the hidden code of an adversarial or variational autoencoder would learn (Figure 5.2c). This makes this architecture suitable for the global vs. local decomposition experiments such as Figure 5.2a. Chapter 5. PixelGAN Autoencoders 72

From Figure 5.3, we can see that the class label information is mostly captured by p(z) while the style information of the images is captured by both p(z) and p(x|z). This decomposition of information has also been studied in other works that combine the latent variable models with autoregressive decoders such as PixelVAE [101] and variational lossy autoencoders (VLAE) [99]. For example, the VLAE model [99] proposes to use the depth of the PixelCNN decoder to control the decomposition of information. In their model, the PixelCNN decoder is designed to have a shallow depth (small local receptive field) so that the latent code z is forced to capture more global information. This approach is very similar to our example of the PixelGAN autoencoder in Figure 5.2. However, the question that has remained unanswered is whether it is possible to achieve a complete decomposition of content and style in an unsupervised fashion, where the class label or discrete structure information is encoded in the latent code z, and the remaining continuous structure such as style is captured by a powerful and deep PixelCNN decoder. This kind of decomposition is particularly interesting as it can be directly used for clustering and semi-supervised classification. In the next section, we show that we can learn this decomposition of content and style by imposing a categorical distribution on the latent representation z using adversarial training. Note that this discrete vs. continuous decomposition is very different from the global vs. local decomposition, because a continuous factor of variation such as style can have both global and local effect on the image. Indeed, in order to achieve the discrete vs. continuous decomposition, we have to use very deep and powerful PixelCNN decoders (up to 20 residual blocks) to capture both the global and local statistics of the style by the PixelCNN while the discrete content of the image is captured by the categorical latent variable.

5.2.2 PixelGAN Autoencoders with Categorical Priors

In this section, we present an architecture of the PixelGAN autoencoder that can separate the discrete information (e.g., class label) from the continuous information (e.g., style information) in the images. We then show how our architecture can be naturally adopted for the semi-supervised settings. The architecture that we use is similar to Figure 5.1, with the difference that we impose a categorical distribution as the prior rather the Gaussian distribution (Figure 5.4) and also use the location-independent bias architecture. Another difference is that we use a convolutional network as the inference network q(z|x) to encourage the encoder to preserve the content and lose the style information of the image. The inference network has a softmax output and predicts a one-hot vector whose dimension is the number of Chapter 5. PixelGAN Autoencoders 73

Figure 5.4: Architecture of the PixelGAN autoencoder with the categorical prior. p(z) captures the class label and p(x|z) is a multi-modal distribution that captures the style distribution of a digit conditioned on the class label of that digit.

discrete labels or categories that we wish the data to be clustered into. The adversarial network is trained directly on the continuous probability outputs of the softmax layer of the encoder. Imposing a categorical distribution at the output of the encoder imposes two constraints. The first constraint is that the encoder has to make confident decisions about the class labels of the inputs. The adversarial training pushes the output of the encoder to the corners of the softmax simplex, by which it ensures that the autoencoder cannot use the latent vector z to carry any continuous style information. The second constraint imposed by adversarial training is that the aggregated posterior distribution of z should match the categorical prior distribution with uniform outcome probabilities. This constraint enforces the encoder to evenly distribute the class labels across the corners of the softmax simplex. Because of these constraints, the latent variable will only capture the discrete content of the image and all the continuous style information will be captured by the autoregressive decoder. In order to better understand and visualize the effect of the adversarial training on shaping the hidden code distribution, we train a PixelGAN autoencoder on the first three digits of MNIST (18000 training and 3000 test points) and choose the number of

clusters to be 3. Suppose z = [z1, z2, z3] is the hidden code which in this case is the output probabilities of the softmax layer of the inference network. In Figure 5.5a, we project the

3D softmax simplex of z1 + z2 + z3 = 1 onto a 2D triangle and plot the hidden codes of the training examples when no distribution is imposed on the hidden code. We can see from this figure that the network has learnt to use the surface of the softmax simplex to encode style information of the digits and thus the three corners of the simplex do not Chapter 5. PixelGAN Autoencoders 74

(a) Without GAN Regularization (b) With GAN Regularization Figure 5.5: Effect of GAN regularization (categorical prior) on the code space of PixelGAN autoencoders.

have any meaningful interpretation. Figure 5.5b corresponds to the code space of the same network when a categorical distribution is imposed using the adversarial training. In this case, we can see the network has successfully learnt to encode the label information of the three digits in the three corners of the simplex, and all the style information has been separately captured by the autoregressive decoder. This network achieves an almost perfect test error-rate of 0.3% on the first three digits of MNIST, even though it is trained in a purely unsupervised fashion. Once the PixelGAN autoencoder is trained, its encoder can be used for clustering new points and its decoder can be used to generate samples from each cluster. Figure 5.6 illustrates the samples of the PixelGAN autoencoder trained on the full MNIST dataset. The number of clusters is set to be 30 and each row corresponds to the conditional samples of one of the clusters (only 16 are shown). We can see that the discrete latent code of the network has learnt discrete factors of variation such as class label information and some discrete style information. For example digit 1s are put in different clusters based on how much tilted they are. The network is also assigning different clusters to digit 2s (based on whether they have a loop) and digit 7s (based on whether they have a dash in the middle). In Section 5.3.1, we will show that by using the encoder of this network, we can obtain about 5% error rate in classifying digits in an unsupervised fashion, just by matching each cluster to a digit type. Semi-Supervised PixelGAN Autoencoders. The PixelGAN autoencoder can be used in a semi-supervised setting. In order to incorporate the label information, we add a semi-supervised training phase. Specifically, we set the number of clusters to be the same Chapter 5. PixelGAN Autoencoders 75

Figure 5.6: Disentangling the content and style in an unsupervised fashion with PixelGAN autoencoders. Each row shows samples of the model from one of the learnt clusters.

as the number of class labels and after executing the reconstruction and the adversarial phases on an unlabeled mini-batch, the semi-supervised phase is executed on a labeled mini-batch, by updating the weights of the encoder q(z|x) to minimize the cross-entropy cost. The semi-supervised cost also reduces the mode-missing behavior of the GAN training by enforcing the encoder to learn all the modes of the categorical distribution. In Section 5.3.2, we will evaluate the performance of the PixelGAN autoencoders on the semi-supervised classification tasks.

5.3 Experiments

In this chapter, we presented the PixelGAN autoencoder as a generative model, but the currently available metrics for evaluating the likelihood of GAN-based generative models such as Parzen window estimate are fundamentally flawed [86]. So in this section, we only present the performance of the PixelGAN autoencoder on downstream tasks such as unsupervised clustering and semi-supervised classification. The details of all the experiments can be found in Appendix 5.6.2.

5.3.1 Unsupervised Clustering

We trained a PixelGAN autoencoder in an unsupervised fashion on the MNIST dataset (Figure 5.6). We chose the number of clusters to be 30 and used the following evaluation

protocol: once the training is done, for each cluster i, we found the validation example xn

that maximizes q(zi|xn), and assigned the label of xn to all the points in the cluster i. We then computed the test error based on the assigned class labels to each cluster. As shown in the first column of Table 5.1, the performance of PixelGAN autoencoders is on Chapter 5. PixelGAN Autoencoders 76

par with other GAN-based clustering algorithms such as CatGAN [89], InfoGAN [97] and adversarial autoencoders (Chapter 4).

5.3.2 Semi-supervised Classification

Table 5.1, Table 5.2 and Figure 5.8 report the results of semi-supervised classification experiments on the MNIST, SVHN and NORB datasets. On the MNIST dataset with 20, 50 and 100 labels, our classification results are highly competitive. Note that the classification rate of unsupervised clustering of MNIST is better than semi-supervised MNIST with 20 labels. This is because in the unsupervised case, the number of clusters is 30, but in the semi-supervised case, there are only 10 class labels which makes it more likely to confuse two digits. On the SVHN dataset with 500 and 1000 labels, the PixelGAN autoencoder outperforms all the other methods except the recently proposed temporal ensembling work [102] which is not a generative model. On the NORB dataset with 1000 labels, the PixelGAN autoencoder outperforms all the other reported results. Figure 5.7 shows the conditional samples of the semi-supervised PixelGAN autoencoder on the MNIST, SVHN and NORB datasets. Each column of this figure presents sampled images conditioned on a fixed one-hot latent code. We can see from this figure that the PixelGAN autoencoder can achieve a rather clean separation of style and content on these datasets with very few labeled data.

(a) SVHN (b) MNIST (c) NORB (1000 labels) (100 labels) (1000 labels) Figure 5.7: Conditional samples of the semi-supervised PixelGAN autoencoder. Chapter 5. PixelGAN Autoencoders 77

Semi-supervised MNIST Semi-supervised SVHN 0.38 100 Labels 0.38 1000 Labels 0.36 0.36 0.34 50 Labels 0.34 500 Labels 0.32 20 Labels 0.32 0.30 0.30 0.28 Unsupervised 0.28 0.26 (30 clusters) 0.26 0.24 0.24 0.22 0.22 0.20 0.20 0.18 0.18 0.16 0.16 Error Rate 0.14 Error Rate 0.14 0.12 0.12 0.10 0.10 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0.00 0.00 0 25 50 75 100 125 150 175 0 100 200 300 400 500 600 700 800 900 Epochs Epochs Figure 5.8: Semi-supervised error-rate of PixelGAN autoencoders on the MNIST and SVHN datasets.

MNIST MNIST MNIST MNIST (Unsupervised) (20 labels) (50 labels) (100 labels) VAE [21] --- 3.33 (±0.14) VAT [88] --- 2.33 ADGM [87] --- 0.96 (±0.02) SDGM [87] --- 1.32 (±0.07) Adversarial Autoencoder [46] 4.10 (±1.13) -- 1.90 (±0.10) Ladder Networks [90] --- 0.89 (±0.50) Convolutional CatGAN [89] 4.27 -- 1.39 (±0.28) InfoGAN [97] 5.00 --- Feature Matching GAN [103] - 16.77 (±4.52) 2.21 (±1.36) 0.93 (±0.06) PixelGAN Autoencoders 5.27 (±1.81) 12.08 (±5.50) 1.16 (±0.17) 1.08 (±0.15) Table 5.1: Semi-supervised learning and clustering error-rate on MNIST dataset.

SVHN SVHN NORB (500 labels) (1000 labels) (1000 labels) VAE [21] - 36.02 (±0.10) 18.79 (±0.05) VAT [88] - 24.63 9.88 ADGM [87] - 22.86 10.06 (±0.05) SDGM [87] - 16.61 (±0.24) 9.40 (±0.04) Adversarial Autoencoder [46] - 17.70 (±0.30) - Feature Matching GAN [103] 18.44 (±4.80) 8.11 (±1.30) - Temporal Ensembling [102] 7.05 (±0.30) 5.43 (±0.25) - PixelGAN Autoencoders 10.47 (±1.80) 6.96 (±0.55) 8.90 (±1.0) Table 5.2: Semi-supervised learning error-rate on SVHN and NORB datasets. Chapter 5. PixelGAN Autoencoders 78

5.4 Learning Cross-Domain Relations with PixelGAN Autoencoders

In this section, we discuss how the PixelGAN autoencoder can be viewed in the context of learning cross-domain relations between two different domains. We also describe how the problem of clustering or semi-supervised learning can be cast as the problem of finding a smooth cross-domain mapping from the data distribution to the categorical distribution. Recently several GAN-based methods have been developed to learn a cross-domain mapping between two different domains [104, 105, 106, 46, 107]. In [106], an unsupervised cost function called the output distribution matching (ODM) is proposed to find a

cross-domain mapping F between two domains D1 and D2 by imposing the following unsupervised constraint on the uncorrelated samples from x ∼ D1 and y ∼ D2:

Distr[F (x)] = Distr[y] (5.5)

where Distr[z] denotes the distribution of the random variable z. The adversarial training is proposed as one of the methods for matching these distributions. If we have access to a few labeled pairs (x, y), then F can be further trained on them in a supervised fashion to satisfy F (x) = y. For example, in speech recognition, we want to find a cross-domain mapping from a sequence of phonemes to a sequence of characters. By optimizing the ODM cost function in Equation 5.5, we can find a smooth function F that takes phonemes at its input and outputs a sequence of characters that respects the language model. However, the main problem with this method is that the network can learn to ignore part of the input distribution and still satisfy the ODM cost function by its output distribution. This problem has also been observed in other works such as [104]. One way to avoid this problem is to add a reconstruction term to the ODM cost function by introducing a reverse mapping from the output of the encoder to the input domain. The is essentially the idea of the adversarial autoencoder (Chapter 4) which learns a generative model by finding a cross-domain mapping between a Gaussian distribution and the data distribution. Using the ODM cost function along with a reconstruction term to learn cross-domain relations have been explored in several previous works. For example, InfoGAN [97] adds a mutual information term to the ODM cost function and optimizes a variational lower bound on this term. It can be shown that maximizing this variational bound is indeed minimizing the reconstruction cost of an autoencoder [96]. Similarly, in [107, 108], an adversarial autoencoder is used to learn the cross-domain relations of the vector representations of words from two different languages. The architecture of the Chapter 5. PixelGAN Autoencoders 79 Style

(a) Output Distribution Matching [106] (b) Adversarial Autoencoders [46] Figure 5.9: Cross-domain adaptation with (a) output distribution matching and (b) AAEs.

recent works of DiscoGAN [104] and CycleGAN [105] are also similar to an adversarial autoencoder in which the latent representation is enforced to have the distribution of the other domain. Here we describe how our proposed PixelGAN autoencoder can be potentially used in all these application areas to learn better cross-domain relations.

Suppose we want to learn a mapping from domain D1 to D2. In the architecture of Figure 5.1, we can use independent samples of x ∼ D1 at the input and instead of imposing a Gaussian distribution on the latent code, we can impose the distribution of the

second domain using its independent samples y ∼ D2. Unlike adversarial autoencoders, the encoder of PixelGAN autoencoders does not have to retain all the input information in order to have a lossless reconstruction. So the encoder can use all its capacity to learn

the most relevant mapping from D1 to D2 and at the same time, the PixelCNN decoder can capture the remaining information that has been lost by the encoder.

We can adopt the ODM idea for semi-supervised learning by assuming D1 is the image domain and D2 is the label domain (Figure 5.9a). Independent samples of D1 and D2

correspond to samples from the data distribution pd(x) and the categorical distribution. The function F = q(y|x) can be parametrized by a neural network that is trained to satisfy R the ODM cost function by matching the aggregated distribution q(y) = q(y|x)pd(x)dx to the categorical distribution using adversarial training. The few labeled examples are used to further train F to satisfy F (x) = y. However, as explained above, the problem with this method is that the network can learn to generate the categorical distribution by ignoring some part of the input distribution. The adversarial autoencoder (Figure 5.9b) Chapter 5. PixelGAN Autoencoders 80

solves this problem by adding an inverse mapping from the categorical distribution to the data distribution. However, the main drawback of the adversarial autoencoder architecture is that due to the reconstruction term, the latent representation now has to model all the underlying factors of variation in the image. For example, in the architecture of Figure 5.9b, while we are only interested in the one-hot label representation to do semi-supervised learning, we also need to infer the style of the image so that we can have a lossless reconstruction of the image. The PixelGAN autoencoder solves this problem by enabling the encoder to only infer the factor of variation that we are interested in (i.e., label information), while the remaining structure of the input (i.e., style information) is automatically captured by the autoregressive decoder.

5.5 Conclusion

In this chapter, we proposed the PixelGAN autoencoder, which is a generative autoencoder that combines a generative PixelCNN with a GAN inference network that can impose arbitrary priors on the latent code. We showed that imposing different distributions as the prior enables us to learn a latent representation that captures the type of statistics that we care about, while the remaining structure of the image is captured by the PixelCNN decoder. Specifically, by imposing a Gaussian prior, we were able to disentangle the low- frequency and high-frequency statistics of the images, and by imposing a categorical prior we were able to disentangle the style and content of images and learn representations that are specifically useful for clustering and semi-supervised learning tasks. While the main focus of this chapter was to demonstrate the application of PixelGAN autoencoders in downstream tasks such as semi-supervised learning, we discussed how these architectures have many other potentials such as learning cross-domain relations between two different domains. Chapter 5. PixelGAN Autoencoders 81

5.6 Appendix

5.6.1 Implementation Details

In this section, we describe two important architecture design choices for training Pixel- GAN autoencoders.

Input noise

In all the semi-supervised experiments, we found it crucial to use the universal approxima- tor posterior discussed in Section 5.2, as opposed to a deterministic posterior. Specifically, the input noise that we use is an additive Gaussian noise, which results in a posterior distribution q(z|x) that is more expressive than that of a model without the input cor- ruption. This is similar to the denoising criterion idea proposed in [109]. We believe this additive noise is also playing an important role in preventing the mode-missing behavior of the GAN when imposing a degenerate distribution such as the categorical distribution. Similar related ideas have been used to stabilize GAN training such as instance noise [110] or one-sided label noise [103].

Conditioning of PixelCNN

There are three methods to implement how the PixelCNN conditions on the latent vector. Location-Invariant Bias. This is the method that was proposed in the conditional PixelCNN model [56]. Suppose the size of the convolutional layer of the decoder is (batch, width, height, channels). Then the PixelCNN can use a linear mapping to convert the conditioning tensor of size (batch, condition_size) to generate a tensor of size (batch, channels) that is then broadcasted and added to the feature maps of all the layers of the PixelCNN decoder as an adaptive bias. In this method, the hidden code is encouraged to learn the global information that is location-invariant (the what information and not the where information) such as the class label information. We use this method in all the clustering and semi-supervised learning experiments. Location-Dependent Bias. Suppose the size of the convolutional layer of the Pixel- CNN decoder is (batch, width, height, channels). Then the PixelCNN can use a one layer neural network to convert the conditioning tensor of size (batch, condition_size) to generate a spatial tensor of size (batch, width, height, k) followed by a 1 × 1 con- volutional layer to construct a tensor of size (batch, width, height, channels) that is then added only to the feature maps of the first layer of the decoder as an adaptive bias (similar to the VPN model [111]). When k = 1, we can simply broadcast the tensor Chapter 5. PixelGAN Autoencoders 82

of size (batch, width, height, k=1) to get a tensor of size (batch, width, height, channels) instead of using the 1 × 1 convolution. In this method, the latent vector has spatial and location-dependent information within the feature map. This is the method that we used in experiments of Figure 5.2a. Input Channel. Another method for conditioning is proposed in the PixelVAE [101] and the variational lossy autoencoder (VLAE) [99]. In this method, first a tensor of size (batch, width, height, k) is constructed using the conditioning tensor similar to the location-dependent bias. This tensor is then concatenated to the input of the PixelCNN. The performance and computational complexity of this method is very similar to that of the location-dependent bias method.

5.6.2 Experiment Details

We used TensorFlow [112] in all of our experiments. As suggested in [11], in order to improve the stability of GAN training, the generator of the GAN in all our experiments is trained to maximize log D(G(z)) rather than minimizing log(1 − D(G(z))).

MNIST Dataset

The MNIST dataset has 50K training points, 10K validation points and 10K test points. We perform experiments on both the binary MNIST and the real-valued MNIST. In the real valued MNIST experiments, we subtract 127.5 from the data points and then divide them by 127.5 and use the discretized logistic mixture likelihood [113] as the cost function for the PixelCNN. In the case of binary MNIST, the data points are binarized by setting pixel values larger than 0.5 to 1, and values smaller than 0.5 to 0. PixelGAN Autoencoders with Gaussian Prior on MNIST. Here we describe the model architecture used for training the PixelGAN autoencoder with a Gaussian prior on the binary MNIST dataset in Figure 5.2a. The PixelCNN decoder uses both the vertical and horizontal stacks similar to [56]. The cost function of the PixelCNN is the cross-entropy cost function. The PixelCNN uses the location-dependent bias as described in Appendix 5.6.1. Specifically, a tensor of size (batch, width, height, 1) is constructed from the conditioning vector by using a one-layer neural network with 1000 hidden units, ReLU activation and linear output. This tensor is then broadcasted and added only to the feature maps of the first layer of the PixelCNN decoder. The PixelCNN is designed to have a local receptive field by having 3 residual blocks (filter size of 3x5, 32 feature maps, ReLU non-linearity as in [56]). The adversarial discriminator has two layers of 2000 hidden units with ReLU activation function. The encoder architecture has two Chapter 5. PixelGAN Autoencoders 83

fully-connected layers of size 2000 with ReLU non-linearity. The last layer of the encoder q(z|x) has a linear activation function. On the latent representation of size 2, we impose a Gaussian distribution with standard deviation of 5. We used the gradient descent with momentum algorithm for optimizing all the cost functions of the network. For the PixelCNN reconstruction cost, we used the learning rate of 0.001 and the momentum value of 0.9. After 25 epochs we reduce the learning rate to 0.0001. For both of the generator and the discriminator costs, the learning rates and the momentum values were set to 0.1. Unsupervised Clustering of MNIST. Here we describe the model architecture used for clustering the binary MNIST dataset in Figure 5.6 and Section 5.3.1. The PixelCNN decoder uses both the vertical and horizontal stacks similar to [56]. The cost function of the PixelCNN is the cross-entropy cost function. The PixelCNN uses the location-invariant bias as described in Appendix 5.6.1 and has 15 residual blocks (filter size of 3x5, 32 feature maps, ReLU non-linearity as in [56]). The adversarial discriminator has two layers of 3000 hidden units with ReLU activation function. The encoder architecture has a convolutional layer (filter size of 7, 32 feature maps, ReLU activation) and a max- pooling layer (pooling size 2), followed by another convolutional layer (filter size of 7, 32 feature maps, ReLU activation) and a max-pooling layer (pooling size 2) with no fully-connected layer. The last layer of the encoder q(z|x) has the softmax activation function. We found it important to use batch-normalization [94] for all the layers of the encoder including the softmax layer. The number of clusters is chosen to be 30. The clusters are represented by a discrete one-hot variable of size 30. On the continuous probability output of the softmax, we impose a categorical distribution with uniform probabilities. We use Adam [114] optimizer with learning rate of 0.001 for optimizing the PixelCNN reconstruction cost function, but we found it important to use the gradient descent with momentum algorithm for optimizing the generator and the discriminator costs of the adversarial network. For both of the generator and the discriminator costs, the momentum values were set to 0.1 and the learning rates were set to 0.01. We use an input dropout noise with the keep probability of 0.8 at the input layer and only at the training time. The model architecture used for Figure 5.5 is the same as this architecture except that the number of clusters is chosen to be 3. Semi-Supervised MNIST. We performed semi-supervised learning experiments on both binary and real-valued MNIST dataset. We found that the semi-supervised error-rate of the real-valued MNIST is roughly the same as the binary MNIST (about 1.10% with 100 labels), but it takes longer to train due to the logistic mixture likelihood Chapter 5. PixelGAN Autoencoders 84

cost function [113]. So in Table 5.1, we only report the performance with the binary MNIST, but in Figure 5.7b we are showing the samples of the real-valued MNIST with 100 labels. Semi-Supervised Binary MNIST. Here we describe the model architecture used for the semi-supervised learning experiments on the binary MNIST in Section 5.3.2 and Table 5.1. The PixelCNN decoder uses both the vertical and horizontal stacks similar to [56] and uses the cross-entropy cost function. The PixelCNN uses the location-invariant bias as described in Appendix 5.6.1. The PixelCNN has 6 residual blocks (filter size of 3x5, 32 feature maps, ReLU non-linearity as in [56]). The adversarial discriminator has two layers of 1000 hidden units with ReLU activation function. The encoder architecture has three convolutional layers (filter size of 5, 32 feature maps, ReLU activation) and a max-pooling layer (pooling size 2), followed by another three convolutional layers (filter size of 5, 32 feature maps, ReLU activation) and a max-pooling layer (pooling size 2) with no fully-connected layer. The last layer of the encoder q(z|x) has the softmax activation function. All the convolutional layers of the encoder except the softmax layer use batch- normalization [94]. On the latent representation, we impose a categorical distribution with uniform probabilities. The semi-supervised cost is the cross-entropy cost function at the output of q(z|x). We use Adam [114] optimizer with learning rate of 0.001 for optimizing the PixelCNN cost and the cross-entropy cost, but we found it important to use the gradient descent with momentum algorithm for optimizing the generator and the discriminator costs of the adversarial network. For both of the generator and the discriminator costs, the momentum values were set to 0.1 and the learning rates were set to 0.1. We add a Gaussian noise with standard deviation of 0.3 to the input layer as described in Appendix 5.6.1. The labeled examples were chosen at random but evenly distributed across the classes. Semi-Supervised Real-valued MNIST. Here we describe the model architecture used for the semi-supervised learning experiments on the real-valued MNIST in Figure 5.7b. The PixelCNN decoder uses both the vertical and horizontal stacks similar to [56] and uses a discretized logistic mixture likelihood cost function with 10 logistic distribution as proposed in [113]. The PixelCNN uses the location-invariant bias as described in Appendix 5.6.1. The PixelCNN has 20 residual blocks (filter size of 2x3, 64 feature maps, gated sigmoid-tanh non-linearity as in [56]). The adversarial discriminator has two layers of 1000 hidden units with ReLU activation function. The encoder architecture has three convolutional layers (filter size of 5, 32 feature maps, ReLU activation) and a max-pooling layer (pooling size 2), followed by another three convolutional layers (filter size of 5, 32 feature maps, ReLU activation) and a max-pooling layer (pooling size 2) with Chapter 5. PixelGAN Autoencoders 85

no fully-connected layer. The last layer of the encoder q(z|x) has the softmax activation function. All the convolutional layers of the encoder except the softmax layer use batch- normalization [94]. On the latent representation, we impose a categorical distribution with uniform probabilities. The semi-supervised cost is the cross-entropy cost function at the output of q(z|x). We use Adam [114] optimizer with learning rate of 0.001 for optimizing the PixelCNN cost and the cross-entropy cost, but we found it important to use the gradient descent with momentum algorithm for optimizing the generator and the discriminator costs of the adversarial network. For both of the generator and the discriminator costs, the momentum values were set to 0.1 and the learning rates were set to 0.1. After 150 epochs, we divide all the learning rates by 10. We add a Gaussian noise with standard deviation of 0.3 to the input layer as described in Appendix 5.6.1. The labeled examples were chosen at random but evenly distributed across the classes.

SVHN Dataset

The SVHN dataset has about 530K training points and 26K test points. We use 10K points for the validation set. Similar to [88], we downsample the images from 32 × 32 × 3 to 16 × 16 × 3 and then subtracte 127.5 from the data points and then divide them by 127.5. Semi-Supervised SVHN. Here we describe the model architecture used for the semi- supervised learning experiments on the SVHN dataset in Section 5.3.2. The PixelCNN decoder uses both the vertical and horizontal stacks similar to [56]. The cost function of the PixelCNN is a discretized logistic mixture likelihood cost function with 10 logistic distribution as proposed in [113]. The PixelCNN uses the location-invariant bias as described in Appendix 5.6.1 and has 20 residual blocks (filter size of 3x5, 32 feature maps, gated sigmoid-tanh non-linearity as in [56]). The adversarial discriminator has two layers of 1000 hidden units with ReLU activation function. The encoder architecture has two convolutional layers (filter size of 5, 32 feature maps, ReLU activation) and a max-pooling layer (pooling size 2), followed by another two convolutional layers (filter size of 5, 32 feature maps, ReLU activation) and a max-pooling layer (pooling size 2) with no fully-connected layer. The last layer of the encoder q(z|x) has the softmax activation function. All the convolutional layers of the encoder except the softmax layer use batch- normalization [94]. On the latent representation, we impose a categorical distribution with uniform probabilities. The semi-supervised cost is the cross-entropy cost function at the output of q(z|x). We use Adam [114] optimizer for optimizing all the cost function. For the PixelCNN cost and the cross-entropy cost we use the learning rate of 0.001 and for the generator and the discriminator costs of the adversarial network we use the learning Chapter 5. PixelGAN Autoencoders 86

rate of 0.0001. We add a Gaussian noise with standard deviation of 0.2 to the input layer as described in Appendix 5.6.1.

NORB Dataset

The NORB dataset has about 24K training points and 24K test points. We use 4K points for the validation set. This dataset has 5 object categories: animals, human figures, airplanes, trucks and cars. We downsample the images to have the size of 32 × 32 × 1, subtract 127.5 from the data points and then divide them by 127.5. Semi-Supervised NORB. The PixelCNN decoder uses both the vertical and hori- zontal stacks similar to [56]. The cost function of the PixelCNN is a discretized logistic mixture likelihood cost function with 10 logistic distribution as proposed in [113]. The PixelCNN uses the location-invariant bias as described in Appendix 5.6.1 and has 15 residual blocks (filter size of 3x5, 32 feature maps, gated sigmoid-tanh non-linearity as in [56]). The adversarial discriminator has two layers of 1000 hidden units with ReLU activation function. The encoder architecture has a convolutional layer (filter size of 7, 32 feature maps, ReLU activation) and a max-pooling layer (pooling size 2), followed by another convolutional layer (filter size of 7, 32 feature maps, ReLU activation) and a max-pooling layer (pooling size 2), followed by another convolutional layer (filter size of 7, 32 feature maps, ReLU activation) and a max-pooling layer (pooling size 2) with no fully-connected layer. The last layer of the encoder q(z|x) has the softmax activation function. All the convolutional layers of the encoder except the softmax layer use batch- normalization [94]. On the latent representation, we impose a categorical distribution with uniform probabilities. The semi-supervised cost is the cross-entropy cost function at the output of q(z|x). We use Adam [114] optimizer for optimizing all the cost function. For the PixelCNN cost and the cross-entropy cost we use the learning rate of 0.001 and for the generator and the discriminator costs of the adversarial network we use the learning rate of 0.0001. We add a Gaussian noise with standard deviation of 0.3 to the input layer as described in Appendix 5.6.1. The labeled examples were chosen at random but evenly distributed across the classes. In the case of NORB with 1000 labels, the test error after 10 epochs is 12.97%, after 100 epochs is 11.63% and after 500 epochs is 8.17%. Chapter 6

Conclusions

6.1 Summary of Contributions

In this thesis, we studied the problem of learning unsupervised representations using autoencoders, and proposed regularization techniques that enable autoencoders to learn useful representations of data in unsupervised and semi-supervised settings. In Chapter 2 and Chapter 3, we proposed regularization techniques that exploit sparsity as a generic prior on the representations. In particular, in Chapter 2, we proposed the k-sparse autoencoder as a very fast sparse coding method that can achieve exact sparsity in the hidden representation. We investigated the effectiveness of sparsity by itself and showed that by solely enforcing sparsity in the hidden units and without using any other nonlinearity or regularization, we can learn unsupervised representations that can achieve competitive classification results. In Chapter 3, we proposed another sparse coding algorithm called winner-take all autoencoders, which is a sparse autoencoder that uses the winner-take-all spatial and lifetime sparsity operators to learn fully-connected and convolutional sparse representations. We observed that convolutional winner-take-all autoencoders learn shift-invariant and diverse dictionary atoms as opposed to position- specific Gabor-like atoms that are typically learnt by conventional sparse coding methods. In Chapter 4 and Chapter 5, we proposed to regularize autoencoders using a generative adversarial network that matches the distribution of the latent variable of the autoencoder with a pre-defined prior. In particular, in Chapter 4, we propose the adversarial autoen- coder, which is a generative autoencoder that uses the GAN framework as a variational inference method for both discrete and continuous latent variables. We showed that adversarial autoencoders can achieve competitive results in generative modeling and semi-supervised classification tasks. In Chapter 5, we proposed the PixelGAN autoen- coder, which is a generative autoencoder in which the generative path is a conditional

87 Chapter 6. Conclusions 88

PixelCNN that conditions on a latent vector, and the recognition path, similar to the adversarial autoencoder, uses a GAN inference network to impose an arbitrary prior distribution on the latent code. We showed that imposing different distributions as the prior enables us to learn a latent representation that captures the type of statistics that we care about, while the remaining structure of the image is captured by the PixelCNN decoder. Specifically, by imposing a Gaussian prior, we were able to disentangle the low-frequency and high-frequency statistics of the images, and by imposing a categorical prior we were able to disentangle the style and content of images and learn representations that are specifically useful for clustering and semi-supervised learning tasks.

6.2 Future Directions

Despite the recent progress in deep learning, one of the major challenges of machine learning and AI in general in the next decade will be unsupervised learning. Here, we propose several future directions to improve unsupervised learning algorithms.

Implicit Distributions for Variational Inference. One of the most successful models for generative image modeling is the generative adversarial network (GAN) [11], which employs a two player min-max game to model the data-distribution. GANs can be considered within the wider framework of implicit generative models [42, 43, 44]. Implicit distributions can be sampled through their generative path, but their likelihood function is not tractable. Recently, several papers have proposed another application of GAN-style algorithms for approximate inference [42, 43, 44, 45, 46, 59, 47, 48, 49]. These algorithms use implicit distributions to learn posterior approximations that are more expressive than the distributions with tractable densities that are often used in variational inference. For example, in Chapter 4, we proposed the adversarial autoencoder which uses a universal approximator posterior as the implicit posterior distribution and use adversarial training to match the aggregated posterior of the latent code to the prior distribution. In Chapter 5, we proposed the PixelGAN autoencoder, a generative autoencoder in which the generative path is a convolutional autoregressive neural network on pixels (PixelCNN) that is conditioned on a latent code, and the recognition path uses an amortized GAN inference technique to learn implicit posterior distributions [46]. There has been some attempts to establish a mathematical connection between maximum-likelihood learning and the cost function of variational inference with implicit distributions such as [47, 43]. Exploring this mathematical connection further and developing new variational inference techniques with implicit distributions are potentially important directions for improving Chapter 6. Conclusions 89

unsupervised learning.

Improving Autoregressive Generative Models. Another successful framework for learning density models is autoregressive neural networks such as NADE [53], MADE [54], PixelRNN [55] and PixelCNN [56]. While these models achieve very high log-likelihood, they have two main drawbacks. First, the autoregressive models such as PixelCNN learn the image densities directly at the pixel level without learning a hierarchical latent representation. This is undesirables as one of the main goals of generative modeling is to discover useful representations that can be used in the downstream tasks such as semi-supervised learning. The second problem is that these models tend to use all their capacity to capture the low-level pixel statistics, and hence the generated images often have little recognizable global structure. In order to address both these problems, we proposed the PixelGAN autoencoder in Chapter 5, which combines the benefits of latent variable models with autoregressive architectures. We showed that the latent variable in PixelGAN autoencoders captures the global structure of the image while the autoregressive decoder captures the local statistics of the image. We have further shown that by using the latent representation of PixelGAN autoencoders, we can achieve the state-of-the-art semi-supervised learning performance on several datasets. There has been other attempts in scaling PixelCNNs to large images by training PixelCNNs that can condition on a low-resolution image and generate higher resolution images such as [115, 116], however, these networks are not trained end-to-end. Scaling the autoregressive generative models to large-scale images and finding efficient ways to incorporate latent variables in these architectures are important future research directions.

Improving Generative Adversarial Networks. Generative adversarial networks (GAN) [11] are one of the most promising methods for generative image modeling. Compared to variational autoencoders, they can generate sharper images, and compared to autoregressive models, they are much faster to sample from. However, they also have some drawbacks. First, unlike variational autoencoders, they only learn a top-down generative model without learning an inference network. So it is not very straightforward to use them for downstream tasks such as semi-supervised learning. Second, similar to autoregressive models, GANs capture the local statistics very well, but the unconditional GAN generated images are often lacking the global structure. Third, training a GAN requires finding a Nash equilibrium of a two player min-max game. The gradient descent algorithm is not a good equilibrium finding algorithm, and thus GAN training is unstable compared to the VAE or PixelCNN training. In this thesis, we proposed adversarial autoencoders (Chapter 4) and PixelGAN autoencoders (Chapter 5) which learn efficient Chapter 6. Conclusions 90 inference networks for GANs and obtain highly-competitive semi-supervised classification results. Other attempts for learning inference networks for GANs include ALI [48], BiGAN [49], Improved-GAN [103] and InfoGAN [97] models. Scaling GANs to large-scale images, making their training more stable and designing efficient inference techniques for them are interesting future research directions.

Semi-supervised Learning. Semi-supervised learning is a long-standing and important problem in Machine Learning. In this thesis, we developed sparse autoencoders (Chapter 2 and Chapter 3) and generative autoencoders (Chapter 4 and Chapter 5) to learn features that are useful for semi-supervised classification tasks. However, semi-supervised learning on large-scale datasets such as ImageNet still remains an open problem. We believe the path to scaling semi-supervised learning to large-scale images is to develop generative models that can capture the statistics of large-scale images and allow efficient inference of the latent variables especially the discrete variables. A good generative model should be able to disentangle the label information from other underlying factors of variation in an unsupervised fashion. This factorization of information enables the posterior over the discrete latent variable to be used for semi-supervised learning. Scaling semi- supervised learning to large-scale images using generative models with discrete variables is an important avenue for future research. Bibliography

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks.,” in NIPS, vol. 1, p. 4, 2012.

[2] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.

[3] B. Alipanahi, A. Delong, M. T. Weirauch, and B. J. Frey, “Predicting the sequence specificities of dna-and rna-binding proteins by deep learning,” Nature biotechnology, vol. 33, no. 8, pp. 831–838, 2015.

[4] H. Y. Xiong, B. Alipanahi, L. J. Lee, H. Bretschneider, D. Merico, R. K. Yuen, Y. Hua, S. Gueroussov, H. S. Najafabadi, T. R. Hughes, et al., “The human splicing code reveals new insights into the genetic determinants of disease,” Science, vol. 347, no. 6218, p. 1254806, 2015.

[5] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.

[6] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxout networks,” ICML, 2013.

[7] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun, “Pedestrian detection with unsupervised multi-stage feature learning,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 3626–3633, IEEE, 2013.

[8] R. Salakhutdinov and G. E. Hinton, “Deep boltzmann machines,” in International Conference on Artificial Intelligence and Statistics, pp. 448–455, 2009.

91 Bibliography 92

[9] V. Nair and G. E. Hinton, “3d object recognition with deep belief nets.,” in NIPS, pp. 1339–1347, 2009.

[10] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” International Conference on Learning Representations (ICLR), 2014.

[11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.

[12] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler, “Skip-thought vectors,” in Advances in neural information processing systems, pp. 3294–3302, 2015.

[13] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.

[14] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.

[15] A. Ng, “Sparse autoencoder,” CS294A Lecture notes, vol. 72, 2011.

[16] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” The Journal of Machine Learning Research, vol. 11, pp. 3371–3408, 2010.

[17] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and com- posing robust features with denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning, pp. 1096–1103, ACM, 2008.

[18] G. Alain and Y. Bengio, “What regularized auto-encoders learn from the data- generating distribution,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 3563–3593, 2014.

[19] Y. Bengio, L. Yao, G. Alain, and P. Vincent, “Generalized denoising auto-encoders as generative models,” in Advances in Neural Information Processing Systems, pp. 899–907, 2013.

[20] S. Rifai, G. Mesnil, P. Vincent, X. Muller, Y. Bengio, Y. Dauphin, and X. Glo- rot, “Higher order contractive auto-encoder,” in Machine Learning and Knowledge Discovery in Databases, pp. 645–660, Springer, 2011. Bibliography 93

[21] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, “Semi-supervised learning with deep generative models,” in Advances in Neural Information Processing Systems, pp. 3581–3589, 2014.

[22] L. Theis, W. Shi, A. Cunningham, and F. Huszár, “Lossy image compression with compressive autoencoders,” arXiv preprint arXiv:1703.00395, 2017.

[23] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, and M. Covell, “Full resolution image compression with recurrent neural networks,” arXiv preprint arXiv:1608.05148, 2016.

[24] K. Gregor, F. Besse, D. J. Rezende, I. Danihelka, and D. Wierstra, “Towards conceptual compression,” in Advances In Neural Information Processing Systems, pp. 3549–3557, 2016.

[25] R. Salakhutdinov and G. Hinton, “Semantic hashing,” International Journal of Approximate Reasoning, vol. 50, no. 7, pp. 969–978, 2009.

[26] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Advances in neural information processing systems, pp. 1753–1760, 2009.

[27] A. Krizhevsky and G. E. Hinton, “Using very deep autoencoders for content-based image retrieval.,” in ESANN, 2011.

[28] B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: A strategy employed by v1?,” Vision research, vol. 37, no. 23, pp. 3311–3325, 1997.

[29] V. Nair and G. E. Hinton, “3d object recognition with deep belief nets,” in Advances in Neural Information Processing Systems, pp. 1339–1347, 2009.

[30] J. A. Tropp and A. C. Gilbert, “Signal recovery from random measurements via orthogonal matching pursuit,” Information Theory, IEEE Transactions on, vol. 53, no. 12, pp. 4655–4666, 2007.

[31] K. Engan, S. O. Aase, and J. Hakon Husoy, “Method of optimal directions for frame design,” in Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on, vol. 5, pp. 2443–2446, IEEE, 1999.

[32] M. Aharon, M. Elad, and A. Bruckstein, “rmk-svd: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Transactions on signal processing, vol. 54, no. 11, pp. 4311–4322, 2006. Bibliography 94

[33] K. Gregor and Y. LeCun, “Learning fast approximations of sparse coding,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 399–406, 2010.

[34] K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “Fast inference in sparse coding algorithms with applications to object recognition,” arXiv preprint arXiv:1010.3467, 2010.

[35] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, pp. 1527–1554, 2006.

[36] R. Salakhutdinov and G. E. Hinton, “Deep boltzmann machines,” in International Conference on Artificial Intelligence and Statistics, pp. 448–455, 2009.

[37] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backpropagation and approximate inference in deep generative models,” International Conference on Machine Learning, 2014.

[38] Y. Burda, R. Grosse, and R. Salakhutdinov, “Importance weighted autoencoders,” arXiv preprint arXiv:1509.00519, 2015.

[39] Z. Hu, Z. Yang, R. Salakhutdinov, and E. P. Xing, “On unifying deep generative models,” arXiv preprint arXiv:1706.00550, 2017.

[40] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal, “The" wake-sleep" algorithm for unsupervised neural networks,” Science, vol. 268, no. 5214, p. 1158, 1995.

[41] B. J. Frey, G. E. Hinton, and P. Dayan, “Does the wake-sleep algorithm produce good density estimators?,” in Advances in neural information processing systems, pp. 661–667, 1996.

[42] S. Mohamed and B. Lakshminarayanan, “Learning in implicit generative models,” arXiv preprint arXiv:1610.03483, 2016.

[43] F. Huszár, “Variational inference using implicit distributions,” arXiv preprint arXiv:1702.08235, 2017.

[44] D. Tran, R. Ranganath, and D. M. Blei, “Deep and hierarchical implicit models,” arXiv preprint arXiv:1702.08896, 2017.

[45] R. Ranganath, D. Tran, J. Altosaar, and D. Blei, “Operator variational inference,” in Advances in Neural Information Processing Systems, pp. 496–504, 2016. Bibliography 95

[46] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, “Adversarial autoen- coders,” International Conference on Learning Representations (ICLR) Workshop, 2016.

[47] L. Mescheder, S. Nowozin, and A. Geiger, “Adversarial variational bayes: Unify- ing variational autoencoders and generative adversarial networks,” arXiv preprint arXiv:1701.04722, 2017.

[48] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. Courville, “Adversarially learned inference,” arXiv preprint arXiv:1606.00704, 2016.

[49] J. Donahue, P. Krähenbühl, and T. Darrell, “Adversarial feature learning,” arXiv preprint arXiv:1605.09782, 2016.

[50] Y. Li, K. Swersky, and R. Zemel, “Generative moment matching networks,” Inter- national Conference on Machine Learning (ICML), 2015.

[51] D. J. Rezende and S. Mohamed, “Variational inference with normalizing flows,” arXiv preprint arXiv:1505.05770, 2015.

[52] D. P. Kingma, T. Salimans, and M. Welling, “Improving variational inference with inverse autoregressive flow,” arXiv preprint arXiv:1606.04934, 2016.

[53] H. Larochelle and I. Murray, “The neural autoregressive distribution estimator.,” in AISTATS, vol. 1, p. 2, 2011.

[54] M. Germain, K. Gregor, I. Murray, and H. Larochelle, “Made: Masked autoencoder for distribution estimation.,” in ICML, pp. 881–889, 2015.

[55] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” arXiv preprint arXiv:1601.06759, 2016.

[56] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al., “Conditional image generation with pixelcnn decoders,” in Advances in Neural Information Processing Systems, pp. 4790–4798, 2016.

[57] A. Makhzani and B. Frey, “K-sparse autoencoders,” International Conference on Learning Representations (ICLR), 2014.

[58] A. Makhzani and B. J. Frey, “Winner-take-all autoencoders,” in Advances in Neural Information Processing Systems, pp. 2791–2799, 2015. Bibliography 96

[59] A. Makhzani and B. Frey, “Pixelgan autoencoders,” Advances in Neural Information Processing Systems, 2017.

[60] A. Coates, A. Y. Ng, and H. Lee, “An analysis of single-layer networks in unsupervised feature learning,” in International Conference on Artificial Intelligence and Statistics, 2011.

[61] J. C. Van Gemert, J.-M. Geusebroek, C. J. Veenman, and A. W. Smeulders, “Kernel codebooks for scene categorization,” in Computer Vision–ECCV 2008, pp. 696–709, Springer, 2008.

[62] K. Swersky, I. Sutskever, D. Tarlow, R. S. Zemel, R. Salakhutdinov, and R. P. Adams, “Cardinality restricted boltzmann machines,” in Advances in Neural Information Processing Systems, pp. 3302–3310, 2012.

[63] T. Blumensath and M. E. Davies, “Iterative hard thresholding for compressed sensing,” Applied and Computational Harmonic Analysis, vol. 27, no. 3, pp. 265–274, 2009.

[64] A. Maleki, “Coherence analysis of iterative thresholding algorithms,” in Communica- tion, Control, and Computing, 2009. Allerton 2009. 47th Annual Allerton Conference on, pp. 236–243, IEEE, 2009.

[65] D. L. Donoho and M. Elad, “Optimally sparse representation in general (nonorthog- onal) dictionaries via l1 minimization,” Proceedings of the National Academy of Sciences, vol. 100, no. 5, pp. 2197–2202, 2003.

[66] Y. LeCun, F. J. Huang, and L. Bottou, “Learning methods for generic object recognition with invariance to pose and lighting,” in Computer Vision and Pattern Recognition, CVPR, vol. 2, pp. II–97, IEEE, 2004.

[67] A. Coates, A. Y. Ng, and H. Lee, “An analysis of single-layer networks in unsupervised feature learning,” in International Conference on Artificial Intelligence and Statistics, pp. 215–223, 2011.

[68] T. Tieleman, “Gnumpy: an easy way to use gpu boards in python,” 2010.

[69] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio, “Why does unsupervised pre-training help deep learning?,” The Journal of Machine Learning Research, vol. 11, pp. 625–660, 2010. Bibliography 97

[70] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of deep networks,” Advances in neural information processing systems, vol. 19, p. 153, 2007.

[71] K. Kavukcuoglu, P. Sermanet, Y.-L. Boureau, K. Gregor, M. Mathieu, and Y. LeCun, “Learning convolutional feature hierarchies for visual recognition.,” in NIPS, vol. 1, p. 5, 2010.

[72] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,” in Proceedings of the 26th Annual International Conference on Machine Learning, pp. 609–616, ACM, 2009.

[73] A. Krizhevsky, “Convolutional deep belief networks on cifar-10,” Unpublished, 2010.

[74] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Deconvolutional networks,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 2528–2535, IEEE, 2010.

[75] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” in NIPS workshop on deep learning and unsupervised feature learning, vol. 2011, p. 5, Granada, Spain, 2011.

[76] M. D. Zeiler and R. Fergus, “Differentiable pooling for hierarchical feature learning,” arXiv preprint arXiv:1207.0151, 2012.

[77] J. Bruna and S. Mallat, “Invariant scattering convolution networks,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 35, no. 8, pp. 1872–1886, 2013.

[78] J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid, “Convolutional kernel networks,” in Advances in Neural Information Processing Systems, pp. 2627–2635, 2014.

[79] M. Ranzato, F. J. Huang, Y.-L. Boureau, and Y. Lecun, “Unsupervised learning of invariant feature hierarchies with applications to object recognition,” in Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pp. 1–8, IEEE, 2007.

[80] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, “Semi-supervised learning with deep generative models,” in Advances in Neural Information Processing Systems, pp. 3581–3589, 2014. Bibliography 98

[81] A. Coates and A. Y. Ng, “Selecting receptive fields in deep networks.,” in NIPS, 2011.

[82] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox, “Discriminative unsupervised feature learning with convolutional neural networks,” in Advances in Neural Information Processing Systems, pp. 766–774, 2014.

[83] T.-H. Lin and H. Kung, “Stable and efficient representation learning with nonnega- tivity constraints,” in Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1323–1331, 2014.

[84] Y. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai, “Better mixing via deep representa- tions,” International Conference on Machine Learning (ICML), 2013.

[85] Y. Bengio, E. Thibodeau-Laufer, G. Alain, and J. Yosinski, “Deep generative stochastic networks trainable by backprop,” International Conference on Machine Learning (ICML), 2014.

[86] L. Theis, A. v. d. Oord, and M. Bethge, “A note on the evaluation of generative models,” arXiv preprint arXiv:1511.01844, 2015.

[87] L. Maaløe, C. K. Sønderby, S. K. Sønderby, and O. Winther, “Auxiliary deep generative models,” arXiv preprint arXiv:1602.05473, 2016.

[88] T. Miyato, S.-i. Maeda, M. Koyama, K. Nakae, and S. Ishii, “Distributional smooth- ing with virtual adversarial training,” stat, vol. 1050, p. 25, 2015.

[89] J. T. Springenberg, “Unsupervised and semi-supervised learning with categorical generative adversarial networks,” arXiv preprint arXiv:1511.06390, 2015.

[90] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, “Semi-supervised learning with ladder networks,” in Advances in Neural Information Processing Systems, pp. 3532–3540, 2015.

[91] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, no. 2579-2605, p. 85, 2008.

[92] L. Maaten, “Learning a parametric embedding by preserving local structure,” in International Conference on Artificial Intelligence and Statistics, pp. 384–391, 2009.

[93] G. Hinton, “Non-linear dimensionality reduction.” https://www.cs.toronto.edu/ hin- ton/csc2535/notes/lec11new.pdf. Bibliography 99

[94] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.

[95] M. D. Hoffman and M. J. Johnson, “Elbo surgery: yet another way to carve up the variational evidence lower bound,” in NIPS 2016 Workshop on Advances in Approximate Bayesian Inference, 2016.

[96] D. Barber and F. V. Agakov, “The im algorithm: A variational approach to information maximization.,” in NIPS, pp. 201–208, 2003.

[97] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “Infogan: Interpretable representation learning by information maximizing generative adver- sarial nets,” in Advances in Neural Information Processing Systems, pp. 2172–2180, 2016.

[98] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio, “Generating sentences from a continuous space,” arXiv preprint arXiv:1511.06349, 2015.

[99] X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel, “Variational lossy autoencoder,” arXiv preprint arXiv:1611.02731, 2016.

[100] F. Huszár, Is Maximum Likelihood Useful for Representation Learning? http: //www.inference.vc/maximum-likelihood-for-representation-learning-2.

[101] I. Gulrajani, K. Kumar, F. Ahmed, A. A. Taiga, F. Visin, D. Vazquez, and A. Courville, “Pixelvae: A latent variable model for natural images,” arXiv preprint arXiv:1611.05013, 2016.

[102] S. Laine and T. Aila, “Temporal ensembling for semi-supervised learning,” arXiv preprint arXiv:1610.02242, 2016.

[103] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Im- proved techniques for training gans,” in Advances in Neural Information Processing Systems, pp. 2226–2234, 2016.

[104] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim, “Learning to discover cross-domain relations with generative adversarial networks,” arXiv preprint arXiv:1703.05192, 2017. Bibliography 100

[105] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” arXiv preprint arXiv:1703.10593, 2017.

[106] I. Sutskever, R. Jozefowicz, K. Gregor, D. Rezende, T. Lillicrap, and O. Vinyals, “Towards principled unsupervised learning,” arXiv preprint arXiv:1511.06440, 2015.

[107] A. V. M. Barone, “Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders,” arXiv preprint arXiv:1608.02996, 2016.

[108] M. Zhang, Y. Liu, H. Luan, and M. Sun, “Adversarial training for unsupervised bilingual lexicon induction,”

[109] D. J. Im, S. Ahn, R. Memisevic, and Y. Bengio, “Denoising criterion for variational auto-encoding framework,” arXiv preprint arXiv:1511.06406, 2015.

[110] C. K. Sønderby, J. Caballero, L. Theis, W. Shi, and F. Huszár, “Amortised map inference for image super-resolution,” arXiv preprint arXiv:1610.04490, 2016.

[111] N. Kalchbrenner, A. v. d. Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu, “Video pixel networks,” arXiv preprint arXiv:1610.00527, 2016.

[112] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Is- ard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Tal- war, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015. Software available from tensorflow.org.

[113] T. Salimans, A. Karpathy, X. Chen, and D. P. Kingma, “Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications,” arXiv preprint arXiv:1701.05517, 2017.

[114] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

[115] S. Reed, A. v. d. Oord, N. Kalchbrenner, S. G. Colmenarejo, Z. Wang, D. Belov, and N. de Freitas, “Parallel multiscale autoregressive density estimation,” arXiv preprint arXiv:1703.03664, 2017. Bibliography 101

[116] R. Dahl, M. Norouzi, and J. Shlens, “Pixel recursive super resolution,” arXiv preprint arXiv:1702.00783, 2017.