On Autoencoders and Score Matching for Energy Based Models

On Autoencoders and Score Matching for Energy Based Models Kevin Swersky* [email protected] Marc'Aurelio Ranzatoy [email protected] David Buchman* [email protected] Benjamin M. Marlin* [email protected] Nando de Freitas* [email protected] *Department of Computer Science, University of British Columbia, Vancouver, BC V6T 1Z4, Canada yDepartment of Computer Science, University of Toronto, Toronto, ON M5S 2G4, Canada Abstract a long history in particular application areas including modeling natural images. We consider estimation methods for the class of continuous-data energy based mod- Recently, more sophisticated latent variable EBMs for els (EBMs). Our main result shows that es- continuous data including the PoT (Welling et al., timating the parameters of an EBM using 2003), mPoT (Ranzato et al., 2010b), mcRBM (Ran- score matching when the conditional distri- zato & Hinton, 2010), FoE (Schmidt et al., 2010) and bution over the visible units is Gaussian cor- others have become popular models for learning rep- responds to training a particular form of reg- resentations of natural images as well as other sources ularized autoencoder. We show how different of real-valued data. Such models, also called gated Gaussian EBMs lead to different autoencoder MRFs, leverage latent variables to represent higher architectures, providing deep links between order interactions between the input variables. In these two families of models. We compare the very active research area of deep learning (Hinton the score matching estimator for the mPoT et al., 2006), these models been employed as elemen- model, a particular Gaussian EBM, to several tary building blocks to construct hierarchical models other training methods on a variety of tasks that achieve very promising performance on several including image denoising and unsupervised perceptual tasks (Ranzato & Hinton, 2010; Bengio, feature extraction. We show that the regular- 2009). ization function induced by score matching Maximum likelihood estimation is the default param- leads to superior classification performance eter estimation approach for probabilistic models due relative to a standard autoencoder. We also to its optimal theoretical properties. Unfortunately, show that score matching yields classifica- maximum likelihood estimation is computationally in- tion results that are indistinguishable from feasible in many EBM models due to the presence of better-known stochastic approximation max- an intractable normalization term (the partition func- imum likelihood estimators. tion) in the model probability. This term arises in EBMs because the exponentiated energies do not au- 1. Introduction tomatically integrate to unity, unlike directed models parameterized by products of locally normalized con- In this work, we consider a rich class of probabilis- ditional distributions (Bayesian networks). Several al- tic models called energy based models (EBMs) (LeCun ternative methods have been proposed to estimate the et al., 2006; Teh et al., 2003; Hinton, 2002). These parameters of an EBM without the need for comput- models define a probability distribution though an ex- ing the partition function. One particularly interest- ponentiated energy function. Markov Random Fields ing method is called score matching (SM) (Hyvärinen, (MRFs) and Restricted Boltzmann Machines (RBMs) 2005). The score matching objective function is con- are the most common instance of such models and have structed from an L2 loss on the difference between the derivatives of the log of the model and empirical distri- th Appearing in Proceedings of the 28 International Con- bution functions with respect to the inputs. Hyvärinen ference on Machine Learning, Bellevue, WA, USA, 2011. Copyright 2011 by the author(s)/owner(s). (2005) showed that this results in a cancellation of the Autoencoders and Score Matching partition function. Further manipulation yields an es- V ⊆ Rnv as follows: timator that can be computed analytically and is prov- exp(−E (v; h)) ably consistent. P (v; h; θ) = θ ; (1) Z(θ) Autoencoder neural networks are another class of models that are often used to model high-dimensional real- nh valued data (Hinton & Zemel, 1994; Vincent et al., where h 2 H ⊆ R are the latent variables, Eθ(v; h) 2008; Vincent, 2011; Kingma & LeCun, 2010). Both is an energy function parameterized by θ 2 Θ, and EBMs and autoencoders are unsupervised models that Z(θ) is the partition function. We refer to these mod- can be thought of as learning to re-represent input data els as latent energy based models. This general la- in a latent space. In contrast to probabilistic EBMs, tent energy based model subsumes many specific mod- autoencoders are deterministic and feed-forward. As els for real-valued data such as Boltzmann machines, a result, autoencoders can be trained to reconstruct exponential-family harmoniums (Welling et al., 2005), their input through one or more hidden layers, they factored RBMs and Product of Student's T (PoT) have fast feed-forward inference for hidden layer states, models (Memisevic & Hinton, 2009; Ranzato & Hin- and all common training losses lead to computation- ton, 2010; Ranzato et al., 2010a;b). ally tractable model estimation methods. In order to The marginal distribution in terms of the free energy learn better representations, autoencoders are often Fθ(v) is obtained by integrating out the hidden vari- modified by tying the weights between the input and ables as seen below. Typically, but not always, this output layers to reduce the number of parameters, in- marginalization can be carried out analytically. cluding additional terms in the objective to bias learning toward sparse hidden unit activations, and adding exp(−F (v)) P (v; θ) = θ : (2) noise to input data to increase robustness (Vincent Z(θ) et al., 2008; Vincent, 2011). Interestingly, Vincent (2011) showed that a particular kind of denoising au- Maximum likelihood parameter estimation is difficult toencoder trained to minimize an L2 reconstruction when Z(θ) is intractable. In EBMs the intractabil- error can be interpreted as Gaussian RBM trained us- ity of Z(θ) arises due to the fact that it is a very ing Hyvärinen'sscore matching estimator. high-dimensional integral that often lacks a closed In this paper, we apply score matching to a number form solution. In such cases, stochastic algorithms of latent variable EBMs where the conditional distri- can be applied to approximately maximize the likeli- bution of the visible units given the hidden units is hood and a variety of algorithms have been described Gaussian. We show that the resulting estimation al- and evaluated (Swersky et al., 2010; Marlin et al., gorithms can be interpreted as minimizing a regular- 2010) in the literature including contrastive divergence ized L2 reconstruction error on the visible units. For (CD) (Hinton, 2002), persistent contrastive divergence Gaussian-binary RBMs, the reconstruction term cor- (PCD) (Younes, 1989; Tieleman, 2008), and fast per- responds to a standard autoencoder with tied weights. sistent contrastive divergence (FPCD) (Tieleman & For the mPoT and mcRBM models, the reconstruc- Hinton, 2009). However, these methods often require tion terms correspond to new autoencoder architec- very careful hand-tuning of optimization-related pa- tures that take into account the covariance structure rameters like step size, momentum, batch size and of the inputs. This suggests a new way to derive novel weight decay, which is complicated by the fact that autoencoder training criteria by applying score match- the objective function can not be computed. ing to the free energy of an EBM. We further generalize The score matching estimator was proposed by score matching to arbitrary EBMs with real-valued in- Hyvärinen(2005) to overcome the intractability of put units and show that this view leads to an intuitive Z(θ) when dealing with continuous data. The score interpretation for the regularization terms that appear matching objective function is defined through a score in the score matching objective function. function applied to the empirical pe(v) and model pθ(v) distributions. The score function for a generic dis- @ log p(v) 2. Score Matching for Latent Energy tribution p(v) is given by i(p(v)) = = @vi @Fθ (v) R @Eθ (v;h) Based Models − = − pθ(hjv)dh. The full objective @vi h @vi function is given below. A latent variable energy based model defines a probability distribution over real valued data vectors v 2 " nv # X 2 J(θ) = ( (p(v)) − (p (v))) : (3) Epe(v) i e i θ i=1 Autoencoders and Score Matching The benefit of optimizing J(θ) is that Z(θ) cancels Example 1 Score Matching for Gaussian- off in the derivative of log pθ(v) since it is constant binary RBMs: Here, the energy Eθ(v; h) is given with respect to each vi. However, in the above form, by: J(θ) is still intractable due to the dependence on pe(v). nv nh nh nv 2 Hyvärinen,shows that under weak regularity condi- X X vi X 1 X (ci − vi) − W h − b h + ; (6) tions J(θ) can be expressed in the following form, ij j j j 2 σi 2 σi which can be tractably approximated by replacing the i=1 j=1 j=1 i=1 expectation over the empirical distribution by an em- where the parameters are θ = (W; σ; b; c) and h 2 pirical average over the training set: j f0; 1g. This leads to the free energy Fθ(v): " nv # X 1 2 @ i(pθ(v)) J(θ) = Ep(v) ( i(pθ(v))) + : (4) nv 2 nh nv !! e 2 @v 1 X (ci − vi) X X vi i=1 i − log 1 + exp W + b ; 2 σ2 σ ij j i=1 i j=1 i=1 i In theoretical situations where the regularity condi- (7) tions on the derivatives of the empirical distribution are not satisfied, or in practical situations where a The corresponding score matching objective is: finite sample approximation to the expectation over the empirical distribution is used, a smoothed version 2 0 12 N nv nh of the score matching estimator may be of interest.

Load more