Unsupervised Representation Learning with Autoencoders By

Unsupervised Representation Learning with Autoencoders by Alireza Makhzani A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto c Copyright 2018 by Alireza Makhzani Abstract Unsupervised Representation Learning with Autoencoders Alireza Makhzani Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto 2018 Despite the recent progress in machine learning and deep learning, unsupervised learning still remains a largely unsolved problem. It is widely recognized that unsupervised learning algorithms that can learn useful representations are needed for solving problems with limited label information. In this thesis, we study the problem of learning unsupervised representations using autoencoders, and propose regularization techniques that enable autoencoders to learn useful representations of data in unsupervised and semi-supervised settings. First, we exploit sparsity as a generic prior on the representations of autoencoders and propose sparse autoencoders that can learn sparse representations with very fast inference processes, making them well-suited to large problem sizes where conventional sparse coding algorithms cannot be applied. Next, we study autoencoders from a probabilistic perspective and propose generative autoencoders that use a generative adversarial network (GAN) to match the distribution of the latent code of the autoencoder with a pre-defined prior. We show that these generative autoencoders can learn posterior approximations that are more expressive than tractable densities often used in variational inference. We demonstrate the performance of developed methods of this thesis on real world image datasets and show their applications in generative modeling, clustering, semi-supervised classification and dimensionality reduction. ii To my beloved parents and sister: Nasrin, Hassan, and Parastesh. iii Acknowledgements Pursuing a PhD was one of the best decisions of my life. I wish to express my sincerest thanks to my advisor, Brendan Frey. I thank Brendan for his continuous support and encouragement, for believing in the work I wanted to pursue, and for being a great inspiration in my life. Brendan is a teacher and mentor with an unmatched combination of intellect, intuition and wit. I was truly blessed to have him as my advisor. I am very grateful to all my committee members, Rich Zemel, Raquel Urtasun, David Duvenaud and Graham Taylor for offering me their thoughtful comments and feedback. I would like to thank all the past and present members of the PSI lab and the Machine Learning group at U of T, especially Babak Alipanahi, Andrew Delong, Christopher Srinivasa, Jimmy Ba, Hannes Bretschneider, Alice Gao, Hui Xiong, Leo Lee, Michael Leung, and Oren Kraus for sharing ideas and collaborating with me. During my PhD, I did two intenrships at Google, both of which were a wonderful learning experience for me. I would like to thank all the members of Google Brain team and Google DeepMind team, especially Oriol Vinyals, Jon Shlens, Navdeep Jaitly, Ian Goodfellow, Ilya Sutskever, Timothy Lillicrap, Ali Eslami, Sam Bowman and Jon Gauthier. I would like to especially thank Alireza Moghaddamjoo and Hamid Sheikhzadeh Nadjar for inspiring me to pursue academic research and working with me during my undergraduate years at Amirkabir University of Technology in Iran. I had the pleasure of going through my PhD journey with many great friends. In particular, I would like to thank Sadegh Jalali, Aynaz Vatankhah, Masoud Barekatain, Amin Heidari, Weria Havary-Nassab, David Jorjani, Parisa Zareapour, Ehsan Shojaei, Siavash Fazeli and Mohammad Norouzi. I take this opportunity to especially thank Nasrin Tehrani and Hamid Emami. I felt home in Canada thanks to their continuous support during the past years. My deepest gratitude and love, of course, belong to my parents, Nasrin and Hassan, and my sister, Parastesh, for their unconditional love and support throughout my life. To them I owe all that I am and all that I have ever accomplished. iv Contents 1 Introduction1 1.1 Overview . .1 1.2 Unsupervised Representation Learning . .2 1.3 Contributions and Outline . 10 2 k-Sparse Autoencoders 13 2.1 Introduction . 13 2.2 k-Sparse Autoencoders . 13 2.3 Analysis of k-Sparse Autoencoders . 15 2.4 Experiments . 19 2.5 Conclusion . 26 3 Winner-Take-All Autoencoders 27 3.1 Introduction . 27 3.2 Fully-Connected Winner-Take-All Autoencoders . 27 3.3 Convolutional Winner-Take-All Autoencoders . 30 3.4 Experiments . 35 3.5 Discussion . 38 3.6 Conclusion . 40 3.7 Appendix . 41 4 Adversarial Autoencoders 44 4.1 Introduction . 44 4.2 Adversarial Autoencoders . 45 4.3 Likelihood Analysis of Adversarial Autoencoders . 52 4.4 Supervised Adversarial Autoencoders . 53 4.5 Semi-Supervised Adversarial Autoencoders . 55 4.6 Clustering with Adversarial Autoencoders . 57 v 4.7 Dimensionality Reduction with Adversarial Autoencoders . 58 4.8 Conclusion . 61 4.9 Appendix . 63 5 PixelGAN Autoencoders 66 5.1 Introduction . 66 5.2 PixelGAN Autoencoders . 67 5.3 Experiments . 75 5.4 Learning Cross-Domain Relations with PixelGAN Autoencoders . 78 5.5 Conclusion . 80 5.6 Appendix . 81 6 Conclusions 87 6.1 Summary of Contributions . 87 6.2 Future Directions . 88 Bibliography 90 vi Chapter 1 Introduction 1.1 Overview The goal of the field of Artificial Intelligence is to enable computers to understand the rich world around us, and to interact with it in an intelligent manner. More recently, “deep learning” has emerged as one of the most promising approaches to achieve this goal. Deep learning is a way of computational learning of high-level concepts in data using deep, hierarchical neural networks. Emerging deep learning methods have revolutionized many real world applications such as computer vision [1], speech processing [2], and computational biology [3, 4] by achieving breakthrough performance on different tasks. In speech recognition for example, deep learning is the only technology that was able to eclipse the well-established 10-year old state of the art benchmarks. Recently, supervised neural networks have been developed and used successfully to produce representations that have enabled leaps forward in classification accuracy for several tasks [1]. These networks are often trained using convolutional architectures with the new regularization techniques that have been developed in the deep learning community such as dropout [5] or maxout [6]. Despite the recent progress in supervised representation learning, the question that has remained unanswered is whether it is possible to learn as “powerful” representations from unlabeled data without any supervision. It is still widely recognized that unsupervised learning algorithms that can extract useful features are needed for solving problems with limited or weak label information, such as video recognition or pedestrian detection [7]. An advantage of unsupervised learning algorithms is the ability to use them in semi-supervised scenarios where the amount of labeled data is limited. 1 Chapter 1. Introduction 2 1.2 Unsupervised Representation Learning In parallel with research on discriminative architectures, neural networks trained in an unsupervised fashion have been developed and used to produce useful representations for several tasks such as clustering and classification [8, 9, 10, 11, 12]. These approaches include stacked autoencoders, deep Boltzmann machines [8], deep belief networks [9], variational autoencoders [10] and generative adversarial networks [11]. In this section, we briefly review these approaches by broadly categorizing them into three types of learning algorithms: autoencoders, sparse models and generative models. For a more comprehensive introduction to unsupervised representation learning, we recommend the Deep Learning book [13]. 1.2.1 Autoencoders An autoencoder is a neural network that learns to reconstruct its original input with the aim of learning useful representations. The input vector x is first mapped to a hidden representation z = f(x). The function f is parametrized by the encoder network using one or more layers of non-linearity. The hidden representation z is then mapped to the output x^ = g(z) using the decoder network which parametrizes the decoder function g. Similar to the encoder network, the decoder network can consist of multiple layers of non-linearity. The parameters are optimized to minimize the following cost function: 2 L(x; g(f(x))) = kx^ − xk2 (1.1) The cost function of Equation 1.1 is usually optimized with an additional constraint to limit the representational capacity of the autoencoder in order to prevent the autoencoder from learning the useless identity function. The constraint could be imposed on the architecture of the autoencoder (e.g., limiting the latent code dimensionality) or could be in the form of a regularization term in the final objective (e.g., sparsity constraint). Undercomplete Autoencoders One way of leaning useful representations with autoencoders is to limit the capacity of the autoencoder by constraining its code size. In this case, the autoencoder is forced to extract the salient feature of the data. Indeed, it can be shown that if an autoencoder uses the linear activation along with the mean squared error criterion, the resulting architecture is equivalent to the PCA algorithm (i.e., the hidden units of the autoencoder Chapter 1. Introduction 3 learn the principal components of the data).

Load more