Unsupervised Feature Learning and Deep Learning: a Review and New Perspectives

1 Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives Yoshua Bengio, Aaron Courville, and Pascal Vincent Department of computer science and operations research, U. Montreal F Abstract— The success of machine learning algorithms generally depends on Among the various ways of learning representations, this data representation, and we hypothesize that this is because differ- paper also focuses on those that can yield more non-linear, ent representations can entangle and hide more or less the different more abstract representations, i.e. deep learning.A deep explanatory factors of variation behind the data. Although domain architecture is formed by the composition of multiple levels of knowledge can be used to help design representations, learning can representation, where the number of levels is a free parameter also be used, and the quest for AI is motivating the design of more which can be selected depending on the demands of the given powerful representation-learning algorithms. This paper reviews recent work in the area of unsupervised feature learning and deep learning, task. This paper is meant to be a follow-up and a complement covering advances in probabilistic models, manifold learning, and deep to an earlier survey (Bengio, 2009) (but see also Arel et al. learning. This motivates longer-term unanswered questions about the (2010)). Here we survey recent progress in the area, with an appropriate objectives for learning good representations, for computing emphasis on the longer-term unanswered questions raised by representations (i.e., inference), and the geometrical connections be- this research, in particular about the appropriate objectives for tween representation learning, density estimation and manifold learning. learning good representations, for computing representations Index Terms—Deep learning, feature learning, unsupervised learning, Boltzmann Machine, RBM, auto-encoder, neural network (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning. In Bengio and LeCun (2007), we introduce the notion of 1 INTRODUCTION AI-tasks, which are challenging for current machine learning Data representation is empirically found to be a core determi- algorithms, and involve complex but highly structured depen- nant of the performance of most machine learning algorithms. dencies. For substantial progress on tasks such as computer For that reason, much of the actual effort in deploying machine vision and natural language understanding, it seems hopeless learning algorithms goes into the design of feature extraction, to rely only on simple parametric models (such as linear preprocessing and data transformations. Feature engineering models) because they cannot capture enough of the complexity is important but labor-intensive and highlights the weakness of interest. On the other hand, machine learning researchers of current learning algorithms, their inability to extract all have sought flexibility in local1 non-parametric learners such of the juice from the data. Feature engineering is a way to as kernel machines with a fixed generic local-response kernel take advantage of human intelligence and prior knowledge to (such as the Gaussian kernel). Unfortunately, as argued at compensate for that weakness. In order to expand the scope length previously (Bengio and Monperrus, 2005; Bengio et al., and ease of applicability of machine learning, it would be 2006a; Bengio and LeCun, 2007; Bengio, 2009; Bengio et al., highly desirable to make learning algorithms less dependent 2010), most of these algorithms only exploit the principle arXiv:1206.5538v1 [cs.LG] 24 Jun 2012 on feature engineering, so that novel applications could be of local generalization, i.e., the assumption that the target constructed faster, and more importantly, to make progress function (to be learned) is smooth enough, so they rely on towards Artificial Intelligence (AI). An AI must fundamentally examples to explicitly map out the wrinkles of the target understand the world around us, and this can be achieved if a function. Although smoothness can be a useful assumption, it learner can identify and disentangle the underlying explanatory is insufficient to deal with the curse of dimensionality, because factors hidden in the observed milieu of low-level sensory data. the number of such wrinkles (ups and downs of the target When it comes time to achieve state-of-the-art results on function) may grow exponentially with the number of relevant practical real-world problems, feature engineering can be interacting factors or input dimensions. What we advocate are combined with feature learning, and the simplest way is to learning algorithms that are flexible and non-parametric2 but learn higher-level features on top of handcrafted ones. This do not rely merely on the smoothness assumption. However, paper is about feature learning, or representation learning, i.e., it is useful to apply a linear model or kernel machine on top learning representations and transformations of the data that somehow make it easier to extract useful information out of it, 1. local in the sense that the value of the learned function at x depends e.g., when building classifiers or other predictors. In the case of mostly on training examples x(t)’s close to x probabilistic models, a good representation is often one that 2. We understand non-parametric as including all learning algorithms whose capacity can be increased appropriately as the amount of data and its captures the posterior distribution of underlying explanatory complexity demands it, e.g. including mixture models and neural networks factors for the observed input. where the number of parameters is a data-selected hyper-parameter. 2 of a learned representation: this is equivalent to learning the neuron model (such as a monotone non-linearity on top of an kernel, i.e., the feature space. Kernel machines are useful, but affine transformation), computation of a kernel, or logic gates. they depend on a prior definition of a suitable similarity metric, Theoretical results clearly show families of functions where a or a feature space in which naive similarity metrics suffice; we deep representation can be exponentially more efficient than would like to also use the data to discover good features. one that is insufficiently deep (Hastad,˚ 1986; Hastad˚ and This brings us to representation-learning as a core ele- Goldmann, 1991; Bengio et al., 2006a; Bengio and LeCun, ment that can be incorporated in many learning frameworks. 2007; Bengio and Delalleau, 2011). If the same family of Interesting representations are expressive, meaning that a functions can be represented with fewer parameters (or more reasonably-sized learned representation can capture a huge precisely with a smaller VC-dimension3), learning theory number of possible input configurations: that excludes one- would suggest that it can be learned with fewer examples, hot representations, such as the result of traditional cluster- yielding improvements in both computational efficiency and ing algorithms, but could include multi-clustering algorithms statistical efficiency. where either several clusterings take place in parallel or the Another important motivation for feature learning and deep same clustering is applied on different parts of the input, learning is that they can be done with unlabeled examples, such as in the very popular hierarchical feature extraction for so long as the factors relevant to the questions we will ask object recognition based on a histogram of cluster categories later (e.g. classes to be predicted) are somehow salient in detected in different patches of an image (Lazebnik et al., the input distribution itself. This is true under the manifold 2006; Coates and Ng, 2011a). Distributed representations and hypothesis, which states that natural classes and other high- sparse representations are the typical ways to achieve such level concepts in which humans are interested are associated expressiveness, and both can provide exponential gains over with low-dimensional regions in input space (manifolds) near more local approaches, as argued in section 3.2 (and Figure which the distribution concentrates, and that different class 3.2) of Bengio (2009). This is because each parameter (e.g. manifolds are well-separated by regions of very low density. the parameters of one of the units in a sparse code, or one As a consequence, feature learning and deep learning are of the units in a Restricted Boltzmann Machine) can be re- intimately related to principles of unsupervised learning, and used in many examples that are not simply near neighbors they can be exploited in the semi-supervised setting (where of each other, whereas with local generalization, different only a few examples are labeled), as well as the transfer regions in input space are basically associated with their own learning and multi-task settings (where we aim to generalize private set of parameters, e.g. as in decision trees, nearest- to new classes or tasks). The underlying hypothesis is that neighbors, Gaussian SVMs, etc. In a distributed representation, many of the underlying factors are shared across classes or an exponentially large number of possible subsets of features tasks. Since representation learning aims to extract and isolate or hidden units can be activated in response to a given input. these factors, representations can be shared across classes and In a single-layer model, each feature is typically associated

Load more