An Analysis of Single-Layer Networks in Unsupervised Feature Learning

An Analysis of Single-Layer Networks in Unsupervised Feature Learning Adam Coates Honglak Lee Andrew Y. Ng Stanford University University of Michigan Stanford University Computer Science Dept. Computer Science and Engineering Computer Science Dept. 353 Serra Mall 2260 Hayward Street 353 Serra Mall Stanford, CA 94305 Ann Arbor, MI 48109 Stanford, CA 94305 Abstract 1 Introduction A great deal of research has focused on al- Much recent work in machine learning has focused on gorithms for learning features from unla- learning good feature representations from unlabeled beled data. Indeed, much progress has been input data for higher-level tasks such as classification. made on benchmark datasets like NORB and Current solutions typically learn multi-level represen- CIFAR by employing increasingly complex tations by greedily “pre-training” several layers of fea- unsupervised learning algorithms and deep tures, one layer at a time, using an unsupervised learn- models. In this paper, however, we show that ing algorithm [11, 8, 18]. For each of these layers a several simple factors, such as the number of number of design parameters are chosen: the number hidden nodes in the model, may be more im- of features to learn, the locations where these features portant to achieving high performance than will be computed, and how to encode the inputs and the learning algorithm or the depth of the outputs of the system. In this paper we study the ef- model. Specifically, we will apply several off- fect of these choices on single-layer networks trained by the-shelf feature learning algorithms (sparse several feature learning methods. Our results demon- auto-encoders, sparse RBMs, K-means clus- strate that several key ingredients, orthogonal to the tering, and Gaussian mixtures) to CIFAR, learning algorithm itself, can have a large impact on NORB, and STL datasets using only single- performance: whitening, large numbers of features, layer networks. We then present a detailed and dense feature extraction can all be major advan- analysis of the effect of changes in the model tages. Even with very simple algorithms and a sin- setup: the receptive field size, number of hid- gle layer of features, it is possible to achieve state-of- den nodes (features), the step-size (“stride”) the-art performance by focusing effort on these choices between extracted features, and the effect rather than on the learning system itself. of whitening. Our results show that large numbers of hidden nodes and dense fea- A major drawback of many feature learning systems ture extraction are critical to achieving high is their complexity and expense. In addition, many performance—so critical, in fact, that when algorithms require careful selection of multiple hyper- these parameters are pushed to their limits, parameters like learning rates, momentum, sparsity we achieve state-of-the-art performance on penalties, weight decay, and so on that must be cho- both CIFAR-10 and NORB using only a sin- sen through cross-validation, thus increasing running gle layer of features. More surprisingly, our times dramatically. Though it is true that recently in- best performance is based on K-means clus- troduced algorithms have consistently shown improve- tering, which is extremely fast, has no hyper- ments on benchmark datasets like NORB [16] and parameters to tune beyond the model struc- CIFAR-10 [13], there are several other factors that af- ture itself, and is very easy to implement. De- fect the final performance of a feature learning sys- spite the simplicity of our system, we achieve tem. Specifically, there are many “meta-parameters” accuracy beyond all previously published re- defining the network architecture, such as the recep- sults on the CIFAR-10 and NORB datasets tive field size and number of hidden nodes (features). (79.6% and 97.2% respectively). In practice, these parameters are often determined by computational constraints. For instance, we might use Appearing in Proceedings of the 14th International Con- the largest number of features possible considering the ference on Artificial Intelligence and Statistics (AISTATS) running time of the algorithm. In this paper, how- 2011, Fort Lauderdale, FL, USA. Volume 15 of JMLR: ever, we pursue an alternative strategy: we employ W&CP 15. Copyright 2011 by the authors. very simple learning algorithms and then more care- 215 An Analysis of Single-Layer Networks in Unsupervised Feature Learning fully choose the network parameters in search of higher considered in the literature are sparse-coding [22, 17, performance. If (as is often the case) larger repre- 32], RBMs [8, 13], sparse RBMs [18], sparse auto- sentations perform better, then we can leverage the encoders [7, 25], denoising auto-encoders [30], “fac- speed and simplicity of these learning algorithms to tored” [24] and mean-covariance [23] RBMs, as well as use larger representations. many others [19, 33]. Thus, amongst the many com- ponents of feature learning architectures, the unsuper- To this end, we will begin in Section 3 by describing vised learning module appears to be the most heavily a simple feature learning framework that incorporates scrutinized. an unsupervised learning algorithm as a “black box” module within. For this “black box”, we have im- Some work, however, has considered the impact of plemented several off-the-shelf unsupervised learning other choices in these feature learning systems, es- algorithms: sparse auto-encoders, sparse RBMs, K- pecially the choice of network architecture. Jarret means clustering, and Gaussian mixture models. We et al. [11], for instance, have considered the impact then analyze the performance impact of several dif- of changes to the “pooling” strategies frequently em- ferent elements in the feature learning framework, in- ployed between layers of features, as well as different cluding: (i) whitening, which is a common pre-process forms of normalization and rectification between lay- in deep learning work, (ii) number of features trained, ers. Similarly, Boureau et al. have considered the im- (iii) step-size (stride) between extracted features, and pact of coding strategies and different types of pooling, (iv) receptive field size. both in practice [3] and in theory [4]. Our work fol- lows in this vein, but considers instead the structure of It will turn out that whitening, large numbers of fea- single-layer networks—before pooling, and orthogonal tures, and small stride lead to uniformly better perfor- to the choice of algorithm or coding scheme. mance regardless of the choice of unsupervised learning algorithm. On the one hand, these results are some- Many common threads from the computer vision lit- what unsurprising. For instance, it is widely held that erature also relate to our work and to feature learning highly over-complete feature representations tend to more broadly. For instance, we will use the K-means give better performance than smaller-sized represen- clustering algorithm as an alternative unsupervised tations [32], and similarly with small strides between learning module. K-means has been used less widely features [21]. However, the main contribution of our in “deep learning” work but has enjoyed wide adoption work is demonstrating that these considerations may, in computer vision for building codebooks of “visual in fact, be critical to the success of feature learning words” [5, 6, 15, 31], which are used to define higher- algorithms—potentially more important even than the level image features. This method has also been ap- choice of unsupervised learning algorithm. Indeed, it plied recursively to build multiple layers of features [1]. will be shown that when we push these parameters to The effects of pooling and choice of activation func- their limits that we can achieve state-of-the-art perfor- tion or coding scheme have similarly been studied for mance, outperforming many other more complex algo- these models [15, 28, 21]. Van Gemert et al., for in- rithms on the same task. Quite surprisingly, our best stance, demonstrate that “soft” activation functions results are achieved using K-means clustering, an algo- (“kernels”) tend to work better than the hard assign- rithm that has been used extensively in computer vi- ment typically used with visual words models. sion, but that has not been widely adopted for “deep” This paper will compare results along some of the same feature learning. Specifically, we achieve the test accu- axes as these prior works (e.g., we will consider both racies of 79.6% on CIFAR-10 and 97.2% on NORB— ’hard’ and ’soft’ activation functions), but our conclu- better than all previously published results. sions differ somewhat: While we confirm that some We will start by reviewing related work on feature feature-learning schemes are better than others, we learning, then move on to describe a general feature also show that the differences can often be outweighed learning framework that we will use for evaluation in by other factors, such as the number of features. Thus, Section 3. We then present experimental analysis and even though more complex learning schemes may im- results on CIFAR-10 [13] as well as NORB [16] in Sec- prove performance slightly, these advantages can be tion 4. overcome by fast, simple learning algorithms that are able to handle larger networks. 2 Related work 3 Unsupervised feature learning Since the introduction of unsupervised pre-training [8], framework many new schemes for stacking layers of features to build “deep” representations have been proposed. In this section, we describe a common framework used Most have focused on creating new training algo- for feature learning. For concreteness, we will focus rithms to build single-layer models that are composed on the application of these algorithms to learning fea- to build deeper structures. Among the algorithms tures from images, though our approach is applicable 216 Adam Coates, Honglak Lee, Andrew Y. Ng to other forms of data as well.

An Analysis of Single-Layer Networks in Unsupervised Feature Learning

Automatic Feature Learning Using Recurrent Neural Networks

Exploring the Potential of Sparse Coding for Machine Learning

Text-To-Speech Synthesis

Unsupervised Speech Representation Learning Using Wavenet Autoencoders Jan Chorowski, Ron J

Sparse Feature Learning for Deep Belief Networks

Comparison of Feature Learning Methods for Human Activity Recognition Using Wearable Sensors

Convex Multi-Task Feature Learning

Visual Word2vec (Vis-W2v): Learning Visually Grounded Word Embeddings Using Abstract Scenes

Deep Learning with a Recurrent Network Structure in the Sequence Modeling of Imbalanced Data for ECG-Rhythm Classiﬁer

Deep Graphical Feature Learning for the Feature Matching Problem

Unsupervised, Backpropagation-Free Convolutional Neural Networks for Representation Learning

Sparse Estimation and Dictionary Learning (For Biostatistics?)