On Autoencoder Scoring

Home , Differentiable function, Sigmoid function

Hanna Kamyshanska [email protected] Goethe Universit¨atFrankfurt, Robert-Mayer-Str. 11-15, 60325 Frankfurt, Germany Roland Memisevic [email protected] University of Montreal, CP 6128, succ Centre-Ville, Montreal H3C 3J7, Canada

Abstract learning can be viewed as a special case, where the exponential of the negative energy integrates to 1. Autoencoders are popular feature learning models because they are conceptually sim- The probably most successful recent examples of non- ple, easy to train and allow for efficient in- probabilistic unsupervised learning are autoencoder ference and training. Recent work has shown networks, which were shown to yield state-of-the-art how certain autoencoders can assign an un- performance in a wide variety of tasks, ranging from normalized “score” to data which measures object recognition and learning invariant representa- how well the autoencoder can represent the tions to syntatic modeling of text (Le et al., 2012; data. Scores are commonly computed by us- Socher et al., 2011; Rolfe & LeCun, 2013; Swersky ing training criteria that relate the autoen- et al., 2011; Vincent, 2011; Memisevic, 2011; Zou et al., coder to a probabilistic model, such as the 2012). Learning amounts to minimizing reconstruction Restricted Boltzmann Machine. In this pa- error using back-prop. Typically, one regularizes the per we show how an autoencoder can assign autoencoder, for example, by adding noise to the in- meaningful scores to data independently of put data or by adding a penalty term to the training training procedure and without reference to objective (Vincent et al., 2008; Rifai et al., 2011). any probabilistic model, by interpreting it as The most common operation after training is to in- a dynamical system. We discuss how, and un- fer the latent representation from data, which can be der which conditions, running the dynamical used, for example, for classification. For the autoen- system can be viewed as performing gradient coder, extracting the latent representation amounts descent in an energy function, which in turn to evaluating a feed-forward mapping. Since training allows us to derive a score via integration. is entirely unsupervised, one autoencoder is typically We also show how one can combine multiple, trained for all classes, and it is the job of the classifier unnormalized scores into a generative classi- to find and utilize class-specific differences in the repre- fier. sentation. This is in contrast to the way in which probabilistic models have been used predominantly in the 1. Introduction past: Although probabilistic models (such as Gaussian mixtures) may be used to extract class-independent Unsupervised learning has been based traditionally on features for classification, it has been more common probabilistic modeling and maximum likelihood esti- to train one model per class and to subsequently use mation. In recent years, a variety of models have been Bayes’ rule for classification (Duda & Hart, 2001). In proposed which define learning as the construction of the case where an energy-based model can assign con- an unnormalized energy surface and inference as find- fidence scores to data, such class-specific unsupervised ing local minima of that surface. Training such energy- learning is possible without Bayes’ rule: When scores based models amounts to decreasing energy near the do not integrate to 1 one can, for example, train a observed training data points and increasing it every- classifier on the vector of scores across classes (Hin- where else (Hinton, 2002; lec). Maximum likelihood ton, 2002), or back-propagate label information to the class-specific models using a “gated softmax” response Proceedings of the 30 th International Conference on Ma- (Memisevic et al., 2011). chine Learning, Atlanta, Georgia, USA, 2013. JMLR: W&CP volume 28. Copyright 2013 by the author(s). On autoencoder scoring

1.1. Autoencoder scoring fidence scores to networks with sigmoid output units (binary data) and arbitrary hidden unit activations (as It is not immediatly obvious how one may compute long as these are integrable). In contrast to (Rolfe & scores from an autoencoder, because the energy land- LeCun, 2013), we do not address the role of dynamics scape does not come in an explicit form. This is in in learning. In fact, we show how one may derive con- contrast to undirected probability models like the Re- fidence scores that are entirely agnostic to the learning stricted Boltzmann Machine (RBM) or Markov Ran- procedure used to train the model. dom Fields, which define the score (or negative energy) as an unnormalized probability distribution. Recent As an application of autoencoder confidence scores we work has shown that autoencoders can assign scores, describe a generative classifier based on class-specific too, if they are trained in a certain way: If noise is autoencoders. The model achieves 1.27% error rate on added to the input data during training, minimizing permutation invariant MNIST, and yields competitive squared error is related to performing score matching performance on the deep learning benchmark dataset (Hyvärinen, 2005) in an undirect probabilistic model, by (Larochelle et al., 2007). as shown by (Swersky et al., 2011) and (Vincent, 2011). This, in turn, makes it possible to use the RBM energy 2. Autoencoder confidence scores as a score. A similar argument can be made for other, related training criteria. For example, (Rifai et al., Autoencoders are feed forward neural networks used 2011) suggest training autoencoders using a “contrac- to learn representations of data. They map input data tion penalty” that encourages latent representations to to a hidden representation using an encoder function be locally flat, and (Alain & Bengio, 2013) show that hW x + b (1) such regularization penalty allows us to interpret the h autoencoder reconstruction function as an estimate of from which the data is reconstructed using a linear the gradient of the data log probability.1 decoder r(x) = AhW x + b + b (2) All these approaches to defining confidence scores rely h r on a regularized training criterion (such as denoising We shall assume that A = W T in the following (“tied or contraction), and scores are computed by using the weights”). This is common in practice, because it re- relationship with a probabilistic model. As a result duces the number of parameters and because related scores can be computed easily only for autoencoders probabilistic models, like the RBM, are based on tied that have sigmoid hidden unit activations and linear weights, too. outputs, and that are trained by minimizing squared For training, one typically minimizes squared recon- error (Alain & Bengio, 2013). The restriction of ac- struction error (r(x) x)2 for a set of training cases. tivation function is at odds with the growing interest When the number of− hidden units is small, autoen- in unconventional activation functions, like quadratic coders learn to perform dimensionality reduction. In or rectified linear units which seem to work better in practice, it is more common to learn sparse represen- supervised recognition tasks (eg., (Krizhevsky et al., tations by using a large number of hidden units and 2012)). training with a regularizer (eg., (Rifai et al., 2011; Vin- In this work, we show how autoencoder confidence cent et al., 2008)). A wide variety of models can be scores may be derived by interpreting the autoencoder learned that way, depending on the activation func- as a dynamical system. The view of the autoencoder tion, number of hidden units and nature of the regu- as a dynamical system was proposed by (Seung, 1998), larization during training. The function h( ) can be · who also demonstrated how de-noising as a learning the identity or it can be an element-wise non-linearity, criterion follows naturally from this perspective. To such as a sigmoid function. Autoencoders defined us- compute scores, we will assume “tied weights”, that ing Eq.2 with tied weights and logistic sigmoid non- −1 is, decoder and encoder weights are transposes of each linearity h(a) = 1 + exp( a) are closely related − other. In contrast to probabilistic arguments based to RBMs (Swersky et al., 2011; Vincent, 2011; Alain on score matching and regularization (Swersky et al., & Bengio, 2013), one can assign confidence scores to 2011; Vincent, 2011; Alain & Bengio, 2013), the dy- data in the form of unnormalized probabilities. namical systems perspective allows us to assign con- For binary data, the decoder typically gets replaced by 1The term “score” is also frequently used to refer to the gradient of the data log probability. In this paper we use r(x) = σ Ah W x + bh + br (3) the term to denote a confidence value that the autoencoder assigns to data. and training is done by minimizing cross-entropy loss. Even though confidence scores (negative free energies) On autoencoder scoring are well-defined for binary output RBMs, there has the autoencoder would be useful is that it allows us to been no analogous score function for the autoencoder, asses how much the autoencoder “likes” a given input because the relationships with score matching breaks x (up to a normalizing constant which is the same down in the binary case (Alain & Bengio, 2013). As for any two inputs). That way, the potential energy we shall show, the perspective of dynamical systems plays an analogous role to the free energy in an RBM allows us to attribute the missing link to the lack of (Hinton, 2002; Swersky et al., 2011). As shown by symmetry. We also show how we can regain symme- (Alain & Bengio, 2013), the view of the autoencoder as try and thereby obtain a confidence score for binary modeling an energy surface implies that reconstruction output autoencoders by applying a log-odds transfor- error is not a good measure of confidence, because the mation on the outputs of the autoencoder. reconstruction error will be low at both local minima and local maxima of the energy. 2.1. Reconstruction as energy minimization A simple condition for a vector field to be a gradient Autoencoders may be viewed as dynamical systems, field is given by Poincare‘s integrability criterion: For by noting that the function r(x) x (using the defini- some open, simple connected set U, a continuously n tion in Eq.2) is a vector field which− represents the lin- differentiable function F : U R defines a gradient → ear transformation that x undergoes as a result of ap- field if and only if plying the reconstruction function r(x)(Seung, 1998;

Alain & Bengio, 2013). Repeatedly applying the re- ∂Fj(x) ∂Fi(x) construction function (possibly with a small inference = , i, j = 1..n (5) ∂xi ∂xj ∀ rate ) to an initial x will trace out a non-linear trajectory x(t) in the data-space. In other words, integrability follows from symmetry of If the number of hidden units is smaller than the num- the partial derivatives. ber of data dimensions, then the set of fixed points of the dynamical system will be approximately a low- Consider an autoencoder with shared weight matrix dimensional manifold in the data-space (Seung, 1998). W and biases bh and br, which has some activation For overcomplete hiddens it can be a more complex function h(.) (e.g. sigmoid, hyperbolic tangent, lin- structure. ear). We have: (Alain & Bengio, 2013), for example, show that for denoising and contractive autoencoder, the reconstruc- ∂(rm(x) xm) X ∂h(W x + bh) − = Wmj Wnj 1 ∂x ∂(W x + b ) − tion is proportional to the derivative of the log proba- n j h bility of x: ∂(rn(x) xn) = − (6) ∂ log P (x) ∂xm r(x) x = λ + O(λ) (4) − ∂(x) so the integrability criterion is satisfied. Running the autoencoder by following a trajectory as prescribed by the vector field may also be viewed in 2.2. Computing the energy surface analogy to running a Gibbs sampler in an RBM, where the fixed points play the role of a maximum probability One way to find the potential energy whose derivative “ridge” and where the samples are deterministic not is r(x) x, is to integrate the vector field (compute its − 2 stochastic. antiderivative). This turns out to be fairly straightforward for autoencoders with a single hidden layer, Some vector fields can be written as the derivative of a linear output activations, and symmetric weights, in scalar field: In such a case, running the dynamical sys- other words tem can be thought of as performing gradient descent in the scalar field. We may call this scalar field energy T E(x) and interpret the vector field as a correspond- r(x) = W h W x + bh + br (7) ing “force” in analogy to physics, where the potential 2After computing the energy function, it is easy to force acting on a point is the gradient of the potential check, in hindsight, whether the vector field defined by energy at that point. The autoencoder reconstruction the autoencoder is really the gradient field of that energy may thus be viewed as pushing data samples in the function: Compute the derivative of the energy, and check direction of lower energy (Alain & Bengio, 2013). if it is equal to r(x) − x. For example, to check the cor- rectness of Eqs. 11 and 13 differentiate the equations wrt. The reason why evaluating the potential energy for x. On autoencoder scoring

Figure 1. Some hidden unit activation functions and their integrals.

1 2 where h( ) is an elementwise activation function, such 3. sum up the activations and subtract 2 x br as the sigmoid.· We can now write − Z Example: sigmoid hiddens. In the case of sigmoid F (x) = (r(x) x) dx activation functions h(u) = (1 + exp( u))−1, we get − − Z T Fsigmoid(x) = W h W x + bh + br x dx (8) − Z 1 2 Z Z = (1 + exp( u))−1 du x b + const T r = W h W x + bh dx + (br x) dx − − 2 − − X T 1 2 = log(1 + exp(W x + bhk)) x br + const ·k − 2 − By deﬁning the auxiliary variables u = W x + bh and k using (11) du = W T dx = W −Tdu (9) which is identical to the free energy in a binary- dx ⇔ Gaussian RBM (eg., (Welling et al., 2005)). It is inter- we get esting to note that in the case of contractive or denoising training (Alain & Bengio, 2013), the energy can in Z T −T T 1 2 fact be shown to approximate the log-probability of F (x) = W W h(u) du + br x x + const − 2 the data (cf., Eq.4): Z T 1 2 = h(u) du + br x x + const Z − 2 F (x) := (r(x) x) dx log P (x) (12) Z − ∝ 1 2 1 T = h(u) du x br + br br + const − 2 − 2 But Eq. 11 is more general as it holds independently Z 1 2 of the training procedure. = h(u) du x br + const (10) − 2 − Example: linear hiddens. The antiderivative of the h u u u2 where the last equation uses the fact that br does not linear activation, ( ) = , is , so for PCA and a depend on x. linear autoencoder, it is simply the norm of the latent representation. More precisely, we have If h(u) is an elementwise activation function, then the Z ﬁnal integral is simply the sum over the antiderivatives 1 2 Flinear(x) = u du x br + const of the hidden unit activation functions applied to x. − 2 − In other words, we can compute the integral using the 1 T 1 2 = u u x br + const following recipe: 2 − 2 − 1 T 1 2 = (W x + bh) (W x + bh) x br + const 1. compute the net inputs to the hidden units: 2 − 2 − (13) W x + bh It is interesting to note that, if we disregard biases and T 2. compute hidden unit activations using the an- assume WW = I (the PCA solution), then Flinear(x) tiderivative of h(u) as the activation function turns into the negative squared reconstruction error. On autoencoder scoring

This is how one would typically assign confidences to (Cover & Thomas, 1991). In that case, the energy PCA models, for example, in a PCA based classfier. takes exactly the same form as the free energy of a binary output RBM with binary hidden units (Hin- It is straightforward to calculate the energies for other ton, 2002). However, hidden unit activation functions hidden unit activations, including those for which the can be chosen arbitrarily and enter the score using the energy cannot be normalized, in which case there is no recipe described above. Also, training data may not corresponding RBM. Two commonly deployed activa- always be binary. When it takes on values between 0 tion functions are, for example: the rectifier, whose and 1, the log-terms do not equal 0 and have to be antiderivative is the “half-square” (sign(x)+1) x2, and 2 included in the energy computation. the softplus activation whose antiderivative is the so- called polylogarithm. A variety of activation functions and their antiderivatives are shown in Fig.1. 3. Combining autoencoders for classification 2.3. Binary data Being able to assign unnormalized scores to data can When dealing with binary data it is common to use be useful in a variety of tasks, including visualization sigmoid activations on the outputs: or classification based on ranking of data examples. Like for the RBM, the lack of normalization causes T r(x) = σ W h W x + bh + br (14) scores to be relative not absolute. This means that we and training the model using cross-entropy (but like in can compare the scores that an autoencoder assigns the case of real outputs, the criterion used for training to multiple data-points but we cannot compare the will not be relevant to compute scores). In the case of scores that multiple autoencoders assign to the same sigmoid outputs activations, the integrability criterion data-point. We shall now discuss how we can turn (Eq.6) does not hold, because of the lack of symmetry these into a classification decision. of the derivatives. However, we can obtain confidence Fig.2 shows exemplary energies that various types scores by monotonically transforming the vector space of contractive autoencoder (cAE, (Rifai et al., 2011)), as follows: We apply the inverse of the logistic sigmoid trained on MNISTsmall digits 6 and 9, assign to test (the “log-odds” transformation) cases from those classes. It shows that all models yield x fairly well separated confidence scores, when the apro- ξ(x) = log (15) 1 x priate anti-derivatives are used. Squared error on the − sigmoid networks separates these examples fairly well in the input domain. Now we can define the new vector too (rightmost plot). However, in a multi-class task, field using reconstruction error typically does not work well v(x) = ξ(r(x)) ξ(x) (Susskind et al., 2011), and it is not a good confi- − T x dence measure as we discussed in Section2. As we = ξ σ W h W x + bh + br log − 1 x shall show, one can achieve competitive classification − T x performance on MNIST and other data-sets by using = W h W x + bh + br log − 1 x properly “calibrated” energies, however. − (16) 3.1. Supervised finetuning The vector field v(x) has the same fixed points as r(x) x, because invertibility of ξ(x) implies Since the energy scores are unnormalized, we cannot − use Bayes’ rule for classification unlike with directed r(x) = x ξ(r(x)) = ξ(x) (17) ⇔ graphical models. Here, we adopt the approach pro- So the original and transformed autoencoder converge posed by (Memisevic et al., 2011) for Restricted Boltz- to the same representations of x. By integrating v(x), mann Machines and adopt it to autoencoders. we get In particular, denoting the energy scores that the Z autoencoders for different classes assign to data as F x h u du b Tx ( ) = ( ) + r Ei(x), i = 1,...,K, we can define the conditional dis- (18) x tribution over classes yi as the softmax response log 1 x x log + const 1 x − − − exp(Ei(x) + Ci) − P (yi x) = (19) | P exp(E (x) + C ) Due to the convention 0 log 0 = 0, the second term, j j j x· log 1 x x log , vanishes for binary data where C is the bias term for class y . Each C may be − − − 1−x i i i On autoencoder scoring

sigmoid activation tanh activation linear activation (real input) linear activation (binary input) sigmoid activation (squared errors) 0.80 0.719 1.0 1.05 class 6 class 6 0.718 0.08 0.75 1.00 class 9 0.8 class 9 0.717 0.95 0.06 0.70 0.716 0.6 ) ) ) ) ) 0.90 x x x x x ( ( 0.65 ( 0.715 ( ( 0.04 9 9 9 9 9 E E E E 0.85 R 0.714 0.4 0.60 0.80 0.02 0.713 0.2 0.55 class 6 class 6 class 6 0.712 0.75 class 9 class 9 0.00 class 9 0.50 0.711 0.0 0.70 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.719 0.720 0.721 0.722 0.723 0.724 0.725 0.0 0.2 0.4 0.6 0.8 1.0 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.05 0.00 0.05 0.10 0.15 0.20 − E6(x) E6(x) E6(x) E6(x) R6(x)

Figure 2. Confidence scores assigned by class-specific contractive autoencoders to digits 6 and 9 from the MNISTsmall data set. viewed also as the normalizing constant of the ith au- one-hot encoded labels, we can write the hidden unit toencoder, which, since it cannot be determined from activations as the input data, needs to be trained from the labeled X X x X f h X h data. hk = ( Wif xi)( tjWjf )Wfk + tjBjk (20) f i j j Eq. 19 may be viewed as a “contrastive” objective function that compares class yi against all other classes This model combines m class-dependent AEs with tied j j j in order to determine the missing normalizing con- weight matrices and biases W , bh, br j = 1..m , stants of the individual autoencoders. It is remi- where each weight matrix W j{ is a product| of two class-} niscent of noise-contrastive estimation (Gutmann & independent (shared) matrices W x and W h and one Hyvärinen, 2012), with the difference that the con- f class-dependent vector Wj .: trastive signal is provided by other classes not noise. X Optimizing the log-probabilities (log of Eq. 19) is sim- W j = W x W f W h (21) ply a form of logistic regression. We shall refer to the ik if jf fk f model as autoencoder scoring (AES) in the following. Like the gated softmax model (Memisevic et al., 2011), The first encoder-layer (W x) learns the class- we may optimize Eq. 19 wrt. the autoencoder param- independent features, the second layer (wf ) learns, eters by back-propagating the logistic regression cost how important these features are for a class and to the autoencoder parameters. weights them accordingly. Finally, the third layer (W h) learns how to overlay the weighted features to 3.2. Parameter factorization get the hidden representation. All these layers have linear activations except for the last one. Reconstruc- We showed in Section2 that scores can be com- tions takes the form puted for autoencoders with a single hidden layer only. However, if we train multi-layer autoencoders j X j j j r (x) = W σ(h ) + b (22) whose bottom layer weights are tied across models, i kj k r k we may view the bottom layers as a way to perform class-independent pre-processing. In many classifica- To learn the model, we pre-train all m autoen- tion tasks this kind of pre-processing makes sense, be- coders together on data across all classes, and we use cause similar features may appear in several classes, the labels to determine for each observation which so there is no need to learn them separately for each intermediate-level weights to train. class. Class-specific autoencoders with shared bottom- layer weights can also be interpreted as standard au- 3.3. Performance evaluation toencoders with factorized weight matrices (Memisevic et al., 2011). We tested the model (factored and plain) on the “deep learning benchmark” (Larochelle et al., 2007). Details Fig.3 shows an illustration of a factored autoencoder. on the data sets are listed in Table1. The model parameters are filter matrices W x and W h which are shared among the classes, as well as matri- To learn the initial representations, we trained the ces W f , Bh and Br which consist of stacked class- class-specific autoencoders with contraction penalty dependent feature-weights and bias vectors. Using (Rifai et al., 2011), and the factored models as denoising models (Vincent et al., 2008) (contraction On autoencoder scoring

Figure 3. Factored autoencoders with weight-sharing. Top: a single factored autoencoder; bottom: encoder- part of multiple factored autoencoders that share weights. Dashed lines represent class-speciﬁc weights, solid lines represent weights that are shared across classes.

Figure 4. Error rates with and without pre-training, using ordinary and factored class-specific autoencoders. Figure 5. Example images and filters learned by class- penalties are not feasible in multilayer networks (Ri- specific autoencoders. Top-to-bottom: RECTimg (class 0 left, class 1 right); MNISTrotImg (factored model); fai et al., 2011)). For comparability with (Memisevic RECTANGLES (factored model). et al., 2011), we used logistic sigmoid hiddens as described in Section 3.2 unless otherwise noted. To train class-dependent autoencoders, we used labeled samples (x, t), where labels t are in one-hot encoding. 1. Train an autoencoder for each class. Then compute for each input x a vector of energies For pre-training, we fixed both the contraction penalty (E1(x), ,Em(x)) (setting the unknown inte- and corruption level to 0.5. In some cases we normal- gration constants··· to zero) and train a linear clas- ized filters after each gradient-step during pretraining. sifier on labeled energy vectors instead of using The model type (factored vs. non-factored) and pa- the original data (Hinton, 2002). rameters for classification (number of hidden units, 2. Train an autoencoder for each class. Then learn number of factors, weight decay and learning rate) only the normalizing constants by maximizing were chosen based on a validation set. In most cases we P the conditional log likelihood i log PC (yi xi) tested 100, 300 and 500 hiddens and factors. The mod- on the training samples as a function of C| = els were trained by gradient descent for 100 epochs. (C1, ,Cm) ··· We compared our approach from Section3 to a variety 3. Optimize Eq. 19 wrt. the autoencoder parame- of baselines and variations: ters using back-prop, but without class-dependent pretraining. On autoencoder scoring

Data set train valid. test cl. DRBM SVM NNet sigm. lin. modulus

RECT 1100 100 50000 2 1.81 1.40 1.93 1.27 1.99 1.76 RECTimg 11000 1000 12000 2 Table 3. Error rates on the standard MNIST dataset us- CONVEX 7000 1000 50000 2 ing sigmoid, linear and modulus activation functions, com- MNISTsmall 10000 2000 50000 10 pared with other models without spatial knowledge. MNISTrot 11000 1000 50000 10 MNISTImg 11000 1000 50000 10 introduce a bias in favor of that model. In Tab.3 we MNISTrand 11000 1000 50000 10 furthermore compare the AES performance for differ- MNISTrotIm 11000 1000 50000 10 ent activation functions on MNIST using contractive AEs with 100 hidden units. The corresponding AE en- Table 1. Datasets details (Larochelle et al., 2007). ergies are shown in Fig.2. Some example images with corresponding filters learned by the ordinary and factored AES model are displayed in Fig.5. We used the Python Theano library (Bergstra et al., 2010) for most Data SVM RBM deep GSM AES of our experiments. An implementation of the model rbf SAA3 is available at: www.iro.umontreal.ca/~memisevr/ RECT 2.15 4.71 2.14 0.56 0.84 aescoring RECTimg 24.04 23.69 24.05 22.51 21.45 CONVEX 19.13 19.92 18.41 17.08 21.52 MNISTsmall 3.03 3.94 3.46 3.70 2.61 4. Conclusion MNISTrot 11.11 14.69 10.30 11.75 11.25 MNISTImg 22.61 16.15 23.00 22.07 22.77 We showed how we may assign unnormalized confi- MNISTrand 14.58 9.80 11.28 10.48 9.70 dence scores to autoencoder networks by interpret- MNISTrotIm 55.18 52.21 51.93 55.16 47.14 ing them as dynamical systems. Unlike previous approaches to computing scores, the dynamical systems Table 2. Classification error rates on the deep learning perspective allows us to compute scores for various benchmark dataset. SVM and RBM results are taken transfer functions and independently of the training from (Vincent et al., 2010), deep net and GSM results criterion. We also show how multiple class-specific au- from (Memisevic et al., 2011). toencoders can be turned into a generative classifier that yields competitive performance in difficult benchmark tasks. 4. Assign data samples to the AEs with the smallest reconstruction error. While a class-independent processing hierarchy is likely to be a good model for early processing in Method 4 seems straightforward, but it did not lead to many tasks, class-specific dynamical systems may of- any reasonable results. Methods 1 and 2 run very fast fer an appealing view of higher level processing. Un- due to the small amount of trainable parameters, but der this view, a class is represented by a dynamic sub-network they do not show good performance either. Method not just a classifier weight. Such a sub- 3 lead to consistently better results, but as illustrated network makes it particularly easy to model complex in Fig.4 it performs worse than the procedure from invariances, since it uses a lot of resources to encode Section3. Two lessons to learn from this are that the within-class variability. (i) generative training of each autoencoder on its own class is crucial to achieve good performance, (ii) it is Acknowledgements not sufficient to tweak normalizing constants, since We thank the members of the LISA Lab at Montreal, backpropagating to the autoencoder parameters sig- in particular Yoshua Bengio, for helpful discussions. nificantly improves performance. This work was supported in part by the German Fed- Table2 shows the classification error rates of the eral Ministry of Education and Research (BMBF) in method from Section3 in comparison to various com- the project 01GQ0841 (BFNT Frankfurt). parable models from the literature. It shows that class-specific autoencoders can yield highly competitive classification results. For the GSM (Memisevic et al., 2011), we report the best performance of factored vs. non- factored on the test data, which may On autoencoder scoring

References Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. Contractive auto-encoders: Explicit in- variance during feature extraction. In International Alain, G. and Bengio, Y. What regularized auto- Conference on Machine Learning (ICML), 2011. encoders learn from the data generating distribu- Rolfe, J. T. and LeCun, Y. Discriminative recurrent tion. In International Conference on Learning Rep- sparse auto-encoders. In International Conference resentations (ICLR), 2013. on Learning Representations (ICLR), 2013. Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Seung, H.S. Learning continuous attractors in recur- Pascanu, R., Desjardins, G., Turian, J., Warde- rent networks. Advances in neural information pro- Farley, D., and Bengio., Y. Theano: a CPU and cessing systems (NIPS), 10:654–660, 1998. GPU math expression compiler. In Python for Sci- entic Computing Conference (SciPy), 2010. Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., and Manning, C. D. Semi-Supervised Recursive Au- Cover, TM and Thomas, J. Elements of Information toencoders for Predicting Sentiment Distributions. Theory. New York: John Wiley & Sons, Inc, 1991. In Conference on Empirical Methods in Natural Duda, H. and Hart, P. Pattern Classification. John Language Processing (EMNLP), 2011. Wiley & Sons, 2001. Susskind, J., Memisevic, R., Hinton, G., and Pollefeys, Gutmann, M. U. and Hyvärinen,A. Noise-contrastive M. Modeling the joint density of two images under a estimation of unnormalized statistical models, with variety of transformations. In International Confer- applications to natural image statistics. Journal ence on Computer Vision and Pattern Recognition of Machine Learning Research, 13:307–361, March (CVPR), 2011. 2012. Swersky, K., Buchman, D., Marlin, B.M., and de Fre- Hinton, G. E. Training products of experts by mini- itas, N. On autoencoders and score matching for mizing contrastive divergence. Neural Computation, energy based models. In International Conference 14(8):1771–1800, 2002. on Machine Learning (ICML), 2011. Hyvärinen,A. Estimation of non-normalized statisti- Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, cal models by score matching. Journal of Machine P. A. Extracting and composing robust features Learning Research, 6:695–709, December 2005. with denoising autoencoders. In International Con- ference on Machine Learning (ICML), 2008. Krizhevsky, A., Sutskever, I., and Hinton, G. Ima- genet classification with deep convolutional neural Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and networks. In Advances in Neural Information Pro- Manzagol, P.A. Stacked denoising autoencoders: cessing Systems (NIPS). 2012. Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Larochelle, H., Erhan, D., Courville, A., Bergstra, Learning Research, 11:3371–3408, 2010. J., and Bengio, Y. An empirical evaluation of deep architectures on problems with many factors of Vincent, Pascal. A connection between score matching variation. In International Conference on Machine and denoising autoencoders. Neural Computation, Learning (ICML), 2007. 23(7):1661–1674, July 2011. Le, Q., Ranzato, M.A., Monga, R., Devin, M., Chen, Welling, M., Rosen-Zvi, M., and Hinton, G. Expo- K., Corrado, G., Dean, J., and Ng, A. Building high- nential family harmoniums with an application to level features using large scale unsupervised learn- information retrieval. Advances in neural informa- ing. In International Conference on Machine Learn- tion processing systems (NIPS), 17, 2005. ing (ICML), 2012. Zou, W.Y., Zhu, S., Ng, A., and Yu, K. Deep learning Memisevic, R. Gradient-based learning of higher-order of invariant features via simulated fixations in video. image features. In the International Conference on In Advances in Neural Information Processing Sys- Computer Vision (ICCV), 2011. tems (NIPS), 2012. Memisevic, R., Zach, C., Hinton, G., and Pollefeys, M. Gated softmax classification. Advances in Neural Information Processing Systems (NIPS), 23, 2011.