On and Score Matching for Energy Based Models

Kevin Swersky* [email protected] Marc’Aurelio Ranzato† [email protected] David Buchman* [email protected] Benjamin M. Marlin* [email protected] Nando de Freitas* [email protected] *Department of Computer Science, University of British Columbia, Vancouver, BC V6T 1Z4, Canada †Department of Computer Science, University of Toronto, Toronto, ON M5S 2G4, Canada

Abstract a long history in particular application areas including modeling natural images. We consider estimation methods for the class of continuous-data energy based mod- Recently, more sophisticated latent variable EBMs for els (EBMs). Our main result shows that es- continuous data including the PoT (Welling et al., timating the parameters of an EBM using 2003), mPoT (Ranzato et al., 2010b), mcRBM (Ran- score matching when the conditional distri- zato & Hinton, 2010), FoE (Schmidt et al., 2010) and bution over the visible units is Gaussian cor- others have become popular models for learning rep- responds to training a particular form of reg- resentations of natural images as well as other sources ularized . We show how different of real-valued data. Such models, also called gated Gaussian EBMs lead to different autoencoder MRFs, leverage latent variables to represent higher architectures, providing deep links between order interactions between the input variables. In these two families of models. We compare the very active research area of deep learning (Hinton the score matching estimator for the mPoT et al., 2006), these models been employed as elemen- model, a particular Gaussian EBM, to several tary building blocks to construct hierarchical models other training methods on a variety of tasks that achieve very promising performance on several including image denoising and unsupervised perceptual tasks (Ranzato & Hinton, 2010; Bengio, feature extraction. We show that the regular- 2009). ization function induced by score matching Maximum likelihood estimation is the default param- leads to superior classification performance eter estimation approach for probabilistic models due relative to a standard autoencoder. We also to its optimal theoretical properties. Unfortunately, show that score matching yields classifica- maximum likelihood estimation is computationally in- tion results that are indistinguishable from feasible in many EBM models due to the presence of better-known stochastic approximation max- an intractable normalization term (the partition func- imum likelihood estimators. tion) in the model probability. This term arises in EBMs because the exponentiated energies do not au- 1. Introduction tomatically integrate to unity, unlike directed models parameterized by products of locally normalized con- In this work, we consider a rich class of probabilis- ditional distributions (Bayesian networks). Several al- tic models called energy based models (EBMs) (LeCun ternative methods have been proposed to estimate the et al., 2006; Teh et al., 2003; Hinton, 2002). These parameters of an EBM without the need for comput- models define a probability distribution though an ex- ing the partition function. One particularly interest- ponentiated energy function. Markov Random Fields ing method is called score matching (SM) (Hyv¨arinen, (MRFs) and Restricted Boltzmann Machines (RBMs) 2005). The score matching objective function is con- are the most common instance of such models and have structed from an L2 loss on the difference between the derivatives of the log of the model and empirical distri- th Appearing in Proceedings of the 28 International Con- bution functions with respect to the inputs. Hyv¨arinen ference on Machine Learning, Bellevue, WA, USA, 2011. Copyright 2011 by the author(s)/owner(s). (2005) showed that this results in a cancellation of the Autoencoders and Score Matching partition function. Further manipulation yields an es- V ⊆ Rnv as follows: timator that can be computed analytically and is prov- exp(−E (v, h)) ably consistent. P (v, h; θ) = θ , (1) Z(θ) Autoencoder neural networks are another class of mod- els that are often used to model high-dimensional real- nh valued data (Hinton & Zemel, 1994; Vincent et al., where h ∈ H ⊆ R are the latent variables, Eθ(v, h) 2008; Vincent, 2011; Kingma & LeCun, 2010). Both is an energy function parameterized by θ ∈ Θ, and EBMs and autoencoders are unsupervised models that Z(θ) is the partition function. We refer to these mod- can be thought of as learning to re-represent input data els as latent energy based models. This general la- in a latent space. In contrast to probabilistic EBMs, tent energy based model subsumes many specific mod- autoencoders are deterministic and feed-forward. As els for real-valued data such as Boltzmann machines, a result, autoencoders can be trained to reconstruct exponential-family harmoniums (Welling et al., 2005), their input through one or more hidden layers, they factored RBMs and Product of Student’s T (PoT) have fast feed-forward inference for hidden layer states, models (Memisevic & Hinton, 2009; Ranzato & Hin- and all common training losses lead to computation- ton, 2010; Ranzato et al., 2010a;b). ally tractable model estimation methods. In order to The marginal distribution in terms of the free energy learn better representations, autoencoders are often Fθ(v) is obtained by integrating out the hidden vari- modified by tying the weights between the input and ables as seen below. Typically, but not always, this output layers to reduce the number of parameters, in- marginalization can be carried out analytically. cluding additional terms in the objective to bias learn- ing toward sparse hidden unit activations, and adding exp(−F (v)) P (v; θ) = θ . (2) noise to input data to increase robustness (Vincent Z(θ) et al., 2008; Vincent, 2011). Interestingly, Vincent (2011) showed that a particular kind of denoising au- Maximum likelihood parameter estimation is difficult toencoder trained to minimize an L2 reconstruction when Z(θ) is intractable. In EBMs the intractabil- error can be interpreted as Gaussian RBM trained us- ity of Z(θ) arises due to the fact that it is a very ing Hyv¨arinen’sscore matching estimator. high-dimensional integral that often lacks a closed In this paper, we apply score matching to a number form solution. In such cases, stochastic algorithms of latent variable EBMs where the conditional distri- can be applied to approximately maximize the likeli- bution of the visible units given the hidden units is hood and a variety of algorithms have been described Gaussian. We show that the resulting estimation al- and evaluated (Swersky et al., 2010; Marlin et al., gorithms can be interpreted as minimizing a regular- 2010) in the literature including contrastive divergence ized L2 reconstruction error on the visible units. For (CD) (Hinton, 2002), persistent contrastive divergence Gaussian-binary RBMs, the reconstruction term cor- (PCD) (Younes, 1989; Tieleman, 2008), and fast per- responds to a standard autoencoder with tied weights. sistent contrastive divergence (FPCD) (Tieleman & For the mPoT and mcRBM models, the reconstruc- Hinton, 2009). However, these methods often require tion terms correspond to new autoencoder architec- very careful hand-tuning of optimization-related pa- tures that take into account the covariance structure rameters like step size, momentum, batch size and of the inputs. This suggests a new way to derive novel weight decay, which is complicated by the fact that autoencoder training criteria by applying score match- the objective function can not be computed. ing to the free energy of an EBM. We further generalize The score matching estimator was proposed by score matching to arbitrary EBMs with real-valued in- Hyv¨arinen(2005) to overcome the intractability of put units and show that this view leads to an intuitive Z(θ) when dealing with continuous data. The score interpretation for the regularization terms that appear matching objective function is defined through a score in the score matching objective function. function applied to the empirical pe(v) and model pθ(v) distributions. The score function for a generic dis- ∂ log p(v) 2. Score Matching for Latent Energy tribution p(v) is given by ψi(p(v)) = = ∂vi ∂Fθ (v) R ∂Eθ (v,h) Based Models − = − pθ(h|v)dh. The full objective ∂vi h ∂vi function is given below. A latent variable energy based model defines a prob- ability distribution over real valued data vectors v ∈ " nv # X 2 J(θ) = (ψ (p(v)) − ψ (p (v))) . (3) Epe(v) i e i θ i=1 Autoencoders and Score Matching

The benefit of optimizing J(θ) is that Z(θ) cancels Example 1 Score Matching for Gaussian- off in the derivative of log pθ(v) since it is constant binary RBMs: Here, the energy Eθ(v, h) is given with respect to each vi. However, in the above form, by: J(θ) is still intractable due to the dependence on pe(v). nv nh nh nv 2 Hyv¨arinen,shows that under weak regularity condi- X X vi X 1 X (ci − vi) − W h − b h + , (6) tions J(θ) can be expressed in the following form, ij j j j 2 σi 2 σi which can be tractably approximated by replacing the i=1 j=1 j=1 i=1 expectation over the empirical distribution by an em- where the parameters are θ = (W, σ, b, c) and h ∈ pirical average over the training set: j {0, 1}. This leads to the free energy Fθ(v): " nv # X 1 2 ∂ψi(pθ(v)) J(θ) = Ep(v) (ψi(pθ(v))) + . (4) nv 2 nh nv !! e 2 ∂v 1 X (ci − vi) X X vi i=1 i − log 1 + exp W + b , 2 σ2 σ ij j i=1 i j=1 i=1 i In theoretical situations where the regularity condi- (7) tions on the derivatives of the empirical distribution are not satisfied, or in practical situations where a The corresponding score matching objective is: finite sample approximation to the expectation over the empirical distribution is used, a smoothed version   2 N nv nh of the score matching estimator may be of interest. 1 X X 1 vin ci X Wij ˆ J(θ) =   − − hjn Consider smoothing p(v) using a probabilistic kernel N 2 σ2 σ2 σ e n=1 i=1 i i j=1 i 0 qβ(v|v ) with bandwidth parameter β > 0. We obtain R 0 0 0  a new distribution qβ(v) = qβ(v|v )p(v )dv . Vincent nh 2 e 1 X Wij   (2011) showed that applying score matching to q (v) − + hˆ 1 − hˆ , (8) β σ2 σ2 jn jn  is equivalent to the following objective function where i j=1 i 0 0 0 qβ(v, v ) = qβ(v|v )pe(v ):  n  " n # ˆ P v vin v where hjn := sigm i=1 σ Wij + bj and sigm(x) := X 0 2 i 0 Q(θ) =Eqβ (v,v ) (ψi(qβ(v|v )) − ψi(pθ(v))) . 1 1+exp(−x) . i=1 (5) For a standardized Normal model, with c = 0 and 0 0 2 For the case where qβ(v|v ) = N v|v , β i.e. a Gaus- σ = 1, this objective reduces to: sian smoothing kernel with variance β2, this is equiv-  2 alent to the regularized score matching objective pro- N n  n  1 X Xv 1 Xh posed in (Kingma & LeCun, 2010). We refer to the ob- J(θ) =  v − W hˆ N 2  in ij jn jective given by Equation 5 as denoising score match- n=1 i=1 j=1 ing (SMD). Although SMD is intractable to evaluate  analytically, we can again replace the integral over v0 nh X 2 ˆ  ˆ  by an empirical average over a finite sample of training −1 + Wijhjn 1 − hjn  , (9) data. We can then replace the integral over v by an j=1 empirical average over samples v, which can be easily 0 0 The first term corresponds to the quadratic recon- drawn from qβ(v|v ) for each training sample v . struction error of an autoencoder with tied weights. Compared to PCD and CD, SM and SMD give From this we can see that this type of of autoencoder, tractable objective functions that can be used to mon- which researchers have previously treated as a differ- itor training progress. While SMD is not consistent, ent model, can in fact be explained by the application it does have significant computational advantages rel- of the score matching estimation principle to Gaussian ative to SM (Vincent, 2011). RBMs.

3. Applying and Interpreting Score Example 2 Score matching for mcRBM: The m c energy Eθ(v, h , h ) of the mcRBM model for each Matching For Latent EBMs m data point includes mean Bernoulli hidden units hj ∈ c We now derive score matching objectives for several {0, 1} and covariance Bernoulli hidden units hk ∈ commonly used EBMs. In order to apply score match- {0, 1}. The latter allow one to model correlations in ing to a particular EBM, one simply needs an expres- the data v (Ranzato & Hinton, 2010; Ranzato et al., sion for the corresponding free energy. 2010a). To ease the notation, we will ignore the index Autoencoders and Score Matching n over the data. The energy for this model is: and γ is a scalar parameter. This leads to the free energy F (v): nf nhc nv nhm nv θ 1 X X c X 2 X X m − Pfkh ( Cif vi) − Wijh vi nhc 2 k j X 1 f=1 k=1 i=1 j=1 i=1 γ log(1 + (φc )2) 2 k nhm nhc nv nv k=1 X m m X c c X v 1 X 2 − bj hj − bkhk − bi vi + vi , (10) nhm nv nv 2 X φm X v 1 X 2 j=1 k=1 i=1 i=1 − log(1 + e j ) − b v + v , (14) i i 2 i j=1 i=1 i=1 where θ = (bv, bm, bc, P, W, C). This leads to the free energy F (v): c Pnv m Pnv m θ where φk = i=1 Cikvi and φj = i=1 Wijvi + bj .

nhc The corresponding score matching objective J(θ) is X φc − log(1 + e k ) equivalent to the objective given in Equation 12 with k=1 the following redefinition of terms: nhm nv nv X φm X v 1 X 2 − log(1 + e j ) − b v + v , (11) P = − Inhc i i 2 i j=1 i=1 i=1 ˆc c hk =γϕ(φk) (15) c 1 Pnf Pnv 2 c m ˆm m where φk = 2 f=1 Pfk( i=1 Cif vi) + bk and φj = hj =sigm(φj ) (16) Pnv m i=1 Wijvi + bj . The corresponding score matching 1 ϕ(x) := objective is: 1 2 1 + 2 (x) " n Xv 1 ρ(x) :=x2, J(θ) = ψ (p (v))2 2 i θ i=1 where Inhc is the nhc × nhc identity matrix. nhc   X ˆc 2 ˆc + ρ(hk)Dik + hkKik In each of these examples, we see that an objective k=1 emerges which seeks to minimize a form of regular-  nhm   ized reconstruction error, and that the forms of these X ˆm ˆm 2 + hj (1 − hj )Wij − 1 (12) regularizers can end up being quite different. Rather j=1 than trying to interpret score matching on a case by nhc nhm case basis, we provide a general theorem for all latent X ˆc X ˆm v ψi(pθ(v)) = hkDik + hj Wij + bi − vi EBMs on which score matching can be applied: k=1 j=1 nf Theorem 1 The score matching objective, Equa- X 2 Kik = PfkCif tion (4), for a latent energy based model can be ex- f=1 pressed succinctly in terms of either the free energy or

nf nv ! ! expectations of the energy with respect to the condi- X X Dik = Pfk Ci0f vi0 Cif tional distribution p(h|v). Specifically, f=1 i0=1 " nv # c X 1 2 ∂ψi(pθ(v)) hˆc =sigm (φ ) J(θ) = (ψ (p(v)) − ψ (p (v))) + k k Epe(v) i e i θ 2 ∂vi ˆm m i=1 hj =sigm φj " nv   2 X 1 ∂Eθ(v, h) ρ(x) :=x(1 − x). = Epe(v) 2 Epθ (h|v) ∂v i=1 i Example 3 Score matching for mPoT The en-    2  m c ∂Eθ(v, h) ∂ Eθ(v, h) ergy Eθ(v, h , h ) of the mPoT model is: + varpθ (h|v) − Epθ (h|v) 2 . ∂vi ∂vi n " n # Xhc 1 Xv hc (1 + ( C v )2) + (1 − γ) log(hc ) k 2 ik i k k=1 i=1 Corollary 1 If the energy function of a latent EBM nv nv nhm nhm 1 X X X X X Eθ(v, h) takes the following form: + v2 − bvv − hmW v − bmhm, 2 i i i j ij i j j i=1 i i=1 j=1 j=1 1 E (v, h) = (v − µ(h))T Ω(h)(v − µ(h)) + g(h), (13) θ 2 where θ = (γ, W, C, bv, bm) and hc is a vector of where µ(h) is an arbitrary vector-valued function of Gamma covariance latent variables, C is a filter bank length nv, g(h) is an arbitrary scalar function, and Autoencoders and Score Matching

Ω(h) is an nv ×nv positive-definite matrix-valued func- 4. Experiments tion, then the vector-valued score function ψ(pθ(v)) In this section, we study several estimation methods will be: Epθ (h|v) [Ω(h)(v − µ(h))]. As a result, the score matching objective can be expressed as: applied to the mPoT model including SM, SMD, CD, PCD, and FPCD with the goal of uncovering differ- ences in the characteristics of trained models due to  nv X 1 2 J(θ) = [Ω(h)(v − µ(h))]  variations in training methods. For our experiments, Epe(v) 2 Epθ (h|v) i i=1 we used two datasets of images.

+varpθ (h|v) [Ω(h)(v − µ(h))]i The first dataset consists of 128,000 color image  patches of size 16x16 pixels randomly extracted from − [Ω(h)] . Epθ (h|v) ii the Berkeley segmentation dataset2. We subtracted the per-patch means and applied PCA whitening. We retained 99% of the variance, corresponding to 105 eigenvectors. All estimation methods were applied to The proofs of Theorem 1 and Corollary 1 are straight- the mPoT model by training on mini-batches of size forward, and can be found in an online appendix to 128 for 100 epochs of stochastic gradient descent. this paper.1 Corollary 1 states that score matching applied to a Gaussian latent EBM will always result The second dataset, named CIFAR 10 (Krizhevsky, in a quadratic reconstruction term with penalties to 2009), consists of color images of size 32x32 pixels minimize the variance of the reconstruction and to belonging to one of 10 categories. The task is to maximize the expected curvature of the energy with classify a set of 10,000 test images. CIFAR 10 is a respect to v. This shows that we can develop new au- subset of a larger dataset of tiny images (Torralba toencoder architectures in a principled way by simply et al., 2008). Using a protocol established in previ- starting with an EBM and applying score matching. ous work (Krizhevsky, 2009; Ranzato & Hinton, 2010) we built a training dataset of 8x8 color image patches One further connection between the two models is that from this larger dataset, ensuring there was no over- one step of gradient descent on the free energy Fθ(v) lap with CIFAR 10. The preprocessing of the data is of an EBM corresponds to one feed-forward step of an exactly the same as for the Berkeley dataset, but here autoencoder. To see this, consider the mPoT model. we use approximately 800,000 image patches and per- If we start at some visible configuration v and update form only 10 epochs of training. For our experiments, a single dimension i: we used the Theano package3, and mPoT4 code from (Ranzato et al., 2010b). (t+1) (t) ∂Fθ(v) vi = vi − η ∂vi 4.1. Objective Function Analysis nhc (t) X ˆc From Corollary 1, we know that we can interpret score = vi + η hkDik k=1 matching for mPoT as trading off reconstruction error,  reconstruction variance and the expected curvature of nhm X (t) the energy function with respect to the visible units. + hˆmW + bv − v . j ij i i  This experiment, using the Berkeley dataset, is de- j=1 signed to determine how these terms evolve over the course of training and to what degree their changes (t) Then setting η = 1, the vi terms cancel and we get: impact the final model. Figures 1(a) and 1(b) show the values of the three terms using non-noisy inputs nhc nhm on each training epoch, as well as the overall objec- (t+1) X ˆc X ˆm v vi = hkDik + hj Wij + bi . (17) tive function (the sum of the 3 terms). Surprisingly, k=1 j=1 these results show that most of the training is involved with maximizing the expected curvature (correspond- This corresponds to the reconstruction produced by ing to a lower negative curvature). In SM, each point mPoT in its score matching objective. In general, an autoencoder reconstruction can be produced by taking 2http://www.cs.berkeley.edu/projects/vision/ a single step of gradient descent along the free energy grouping/segbench/ 3http://deeplearning.net/software/theano/ of its corresponding EBM. 4 http://www.cs.toronto.edu/~ranzato/ 1 http://www.cs.ubc.ca/~nando/papers/ publications/mPoT/mPoT.html smpaper-appendix.pdf Autoencoders and Score Matching

1 100 200 Total Total Total Recon Recon Recon 0.5 Var 0 Var 100 Var Curve Curve Curve

0 −100 0

Value −0.5 Value −200 Value −100

−1 −300 −200

−1.5 −400 −300 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Epoch Epoch Epoch (a) SM terms (×104) (b) SMD terms (c) Autoencoder terms

400 0.9 0.7 FPCD FPCD FPCD PCD PCD 0.65 PCD CD 0.8 CD CD 300 SM SM 0.6 SM SMD SMD SMD 0.7 0.55 200 MSE 0.6 MSE 0.5

0.45 100 0.5

Free energy difference 0.4

0 0.4 0.35 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 Epoch Epoch Epoch (d) Free energy difference (e) Mean-field denoising (f) Bayesian denoising

Figure 1. (a), (b), (c) Expected reconstruction error, reconstruction variance, and energy curvature for SM, SMD, and AE. Total represents the sum of these terms. (d) Difference of free energy between noisy and test images. (e) MSE of denoised test images using mean-field. (f) MSE of denoised test images using Bayesian MAP. is relatively isolated in v-space meaning that the ob- and patches corrupted by Gaussian noise. We also jective will try to make the distribution very peaked. consider denoising natural image patches.5 In SMD, each point exists near a cloud of points and During training, we hope that the probability of nat- so the distribution must be broader. From this per- ural images will increase while that of other images spective, SMD can be seen as a regularized version of decreases. The free energy difference between natu- SM that puts less emphasis on changing the expected ral and other images is equivalent to the log of their curvature. This also seems to give SMD some room to probability ratio, so we expect the free energy differ- reduce the reconstruction error. ence to increase during training as well. Figure 1(d) To examine the impact of regularization, we trained shows the difference in free energy between a test set of an autoencoder (AE) based on the mPoT model us- 10,000 image patches from the Berkeley dataset, and ing the reconstruction given by Equation 17, which the energy of the same images corrupted by noise. For corresponds to SM without the variance and curva- most estimators, the free energy difference improves ture terms. Figure 1(c) shows that simply optimizing as training proceeds, as expected. Interestingly, SM the reconstruction leaves the curvature almost invari- and SMD exhibit completely opposite behaviors. SM ant, which agrees with the findings of (Ranzato et al., seems to significantly increase the free energy differ- 2007). ence relative to nearby noisy images, corresponding to a distribution that is peaked around natural images. 4.2. Denoising SMD, on the other hand, actually decreases the free energy difference relative to nearby noisy images. In our next set of experiments, we compare models learned by each of the score matching estimators with In the next experiment, we consider an image denois- models learned by the more commonly used stochastic ing task. We take an image patch v and add Gaussian estimators. For these experiments, we trained mPoT white noise, obtaining a noisy patch v0. We then ap- models corresponding to SM, SMD, FPCD, PCD, and 5Note that for convenience, both tasks were performed CD. We compare the models in terms of the average in the PCA domain. We use a standard deviation of 1 for free energy difference between natural image patches the Gaussian noise in all cases. Autoencoders and Score Matching

(a) Mean filters (b) Covariance filters

Figure 2. mPoT filters learned using different estimation methods: (a) “mean” filters, (b) “covariance” filters. ply each model to denoise each patch v0, obtaining a ters. With the exception of AE, all methods appear reconstructionv ˆ. The first denoising method, shown to do well and the differences between them are not in Figure 1(e), computes a reconstructionv ˆ by simu- statistically significant. AE, on the other hand, does lating one step of a Markov chain using a mean-field significantly worse. approximation. That is, we first compute hc and hm k j Finally, we show examples of filters learned by each by Equations 15 and 16 using v0 as the input. The re- method. Figure 2(a) shows a random subset of “mean” construction is the expectation of the conditional dis- filters corresponding to the columns of W , while Fig- tribution P (v|hc , hm). The second method, shown in θ k j ure 2(b) shows a random subset of “covariance” filters Figure 1(f), is the Bayesian MAP estimator: corresponding to the columns of C. Interestingly, only FPCD and PCD show structure in the learned mean λ 0 2 vˆ = arg min Fθ(v) + kv − v k , (18) filters. In the covariance units, all methods except AE v 2 learn localized Gabor-like filters. It is well known that where λ is a scalar representing how close the re- obtaining nice looking filters will usually correlate with construction should remain to the noisy input. We good performance, but it is not always clear what leads select λ by cross-validation. The results show that to these filters. score matching achieves the minimum error using both denoising approaches, however it quickly overfits as We have shown here that one way to obtain good qual- training proceeds. FPCD and PCD do not match the itative and quantitative performance is to focus on ap- minimum error of SM and also overfit, albeit to a lesser propriately modeling the curvature of the energy with extent. CD and SMD do not appear to overfit. How- respect to v. In this context, the SM reconstruction ever, we note that the minimum error obtained by and variance terms serve to ensure that the peaks of SMD is significantly higher than the minimum error the distribution occur around the training cases. obtained by SM using both denoising methods. This is quite intuitive since SMD is equivalent to estimating 5. Conclusion the model using a smoothed training distribution that shifts mass onto nearby noisy images. By applying score matching to the energy space of a latent EBM, as opposed to the free energy space, we gain an intuitive interpretation of the score matching 4.3. Feature Extraction and Classification objective. We can always break the objective down One of the primary uses for latent EBMs is to gener- into three terms corresponding to expectations under ate discriminative features. Table 1 shows the result of the conditional distribution of the hidden units: recon- using each method to extract features on the bench- struction, reconstruction variance, and curvature. We mark CIFAR 10 dataset. We follow the protocol of have determined that for the Gaussian-binary RBM, (Ranzato & Hinton, 2010) with early stopping. We the reconstruction term will always correspond to an use a validation set to select regularization parame- autoencoder with tied weights. While autoencoders and RBMs were previously considered to be related, but separate models, this analysis shows that they Table 1. Recognition accuracy on CIFAR 10. can be interpreted as different estimators applied to CD PCD FPCD SM SMD AE the same underlying model. We also showed that one can derive novel autoencoders by applying score 64.6% 64.7% 65.5% 65.0% 64.7% 57.6% matching to more complex EBMs. This allows us to Autoencoders and Score Matching think about models in terms of EBMs before creating Ranzato, M. and Hinton, G.E. Modeling pixel means and a corresponding autoencoder to leverage fast inference. covariances using factorized third-order Boltzmann ma- Furthermore, this framework provides guidance on se- chines. In IEEE and Pattern Recogni- tion, pp. 2551–2558, 2010. lecting principled regularization functions for autoen- coder training, leading to improved representations. Ranzato, M., Boureau, Y.L., Chopra, S., and LeCun, Y. A unified energy-based framework for unsupervised learn- Our experiments show that not only does score match- ing. In Artificial Intelligence and Statistics, 2007. ing yield similar performance to existing estimation Ranzato, M., Krizhevsky, A., and Hinton, G.E. Factored 3- methods when applied to classification, but that shap- way restricted Boltzmann machines for modeling natural ing the curvature of the energy appropriately may be images. In Artificial Intelligence and Statistics, pp. 621– important for generating good features. While this 628, 2010a. seems obvious for probabilistic EBMs, it has previ- Ranzato, M., Mnih, V., and Hinton, G.E. How to gener- ously been difficult to apply to autoencoders because ate realistic images using gated MRF’s. In Advances in they were not thought of as having a corresponding en- Neural Information Processing Systems, pp. 2002–2010, ergy function. Now that we know which statistics may 2010b. be important to monitor during training, it would be Schmidt, U., Gao, Q., and Roth, S. A generative perspec- interesting to see what happens when other heuristics, tive on MRFs in low-level vision. In IEEE Computer such as sparsity, are applied to help generate inter- Vision and Pattern Recognition, 2010. pretable features. Swersky, K., Chen, B., Marlin, B.M., and de Freitas, N. A tutorial on stochastic approximation algorithms for training restricted Boltzmann machines and deep belief References nets. In Information Theory and Applications Workshop, Bengio, Y. Learning deep architectures for AI. Foundations pp. 1 –10, 2010. and Trends in Machine Learning, 2(1):1–127, 2009. Teh, Y.W., Welling, M., Osindero, S., and Hinton, G.E. Energy-based models for sparse overcomplete represen- Hinton, G.E. Training products of experts by minimizing tations. Journal of Machine Learning Research, 4:1235– contrastive divergence. Neural Computation, 14:1771– 1260, 2003. 1800, 2002. Tieleman, T. Training restricted Boltzmann machines us- Hinton, G.E. and Zemel, R.S. Autoencoders, minimum ing approximations to the likelihood gradient. In In- description length and Helmholtz free energy. In Ad- ternational Conference on Machine Learning, pp. 1064– vances in Neural Information Processing Systems, pp. 1071, 2008. 3–10, 1994. Tieleman, T. and Hinton, G.E. Using fast weights to im- Hinton, G.E., Osindero, S., and Teh, Y.W. A fast learning prove persistent contrastive divergence. In International algorithm for deep belief nets. Neural Computation, 18 Conference on Machine Learning, 2009. (7):1527–1554, 2006. Torralba, A., Fergus, R., and Freeman, W.T. 80 million Hyv¨arinen,A. Estimation of non-normalized statistical tiny images: A large dataset for non-parametric object models using score matching. Journal of Machine Learn- and scene recognition. IEEE Transactions on Pattern ing Research, 6:695–709, 2005. Analysis and Machine Intelligence, 30:1958–1970, 2008.

Kingma, D. and LeCun, Y. Regularized estimation of im- Vincent, P. A connection between score matching and de- age statistics by score matching. In Advances in Neural noising autoencoders. Neural Computation, To appear, Information Processing Systems, 2010. 2011. Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.A. Krizhevsky, A. Learning multiple layers of features from Extracting and composing robust features with denois- tiny images, 2009. MSc Thesis, Dept. of Comp. Science, ing autoencoders. In International Conference on Ma- Univ. of Toronto. chine Learning, pp. 1096–1103, 2008. LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., and Welling, M., Hinton, G.E., and Osindero, S. Learn- Huang, F.J. A tutorial on energy-based learning. In ing sparse topographic representations with products of Predicting Structured Data. MIT Press, 2006. student-t distributions. In Advances in Neural Informa- tion Processing Systems, 2003. Marlin, B.M., Swersky, K., Chen, B., and de Freitas, N. Inductive principles for restricted Welling, M., Rosen-Zvi, M., and Hinton, G.E. Exponential learning. In Artificial Intelligence and Statistics, pp. family harmoniums with an application to information 509–516, 2010. retrieval. In Advances in Neural Information Processing Systems, 2005. Memisevic, R. and Hinton, G.E. Learning to represent spa- tial transformations with factored higher-order Boltz- Younes, L. Parametric inference for imperfectly observed mann machines. Neural Computation, 22:1473–1492, Gibbsian fields. Probability Theory and Related Fields, 2009. 82(4):625–645, 1989.