<<

Variational () Estimate of the Score Function in Energy-based Latent Variable Models

Fan Bao 1 2 Kun Xu 1 Chongxuan Li 1 Lanqing Hong 2 Jun Zhu 1 Bo Zhang 1

Abstract sive power (Salakhutdinov & Hinton, 2009; Bao et al., 2020) and to enable representation learning (Welling et al., 2004; This paper presents new estimates of the score Salakhutdinov & Hinton, 2009; Srivastava & Salakhutdinov, function and its gradient with respect to the model 2012) and conditional sampling (see results in Sec. 4.2). parameters in a general energy-based latent vari- able model (EBLVM). The score function and Notably, the score function (SF) of an EBM is defined as its gradient can be expressed as combinations of the gradient of its log-density with respect to the visible expectation and covariance terms over the (gener- variables, which is independent of the partition function ally intractable) posterior of the latent variables. and thereby tractable. Due to its tractability, the SF-based New estimates are obtained by introducing a vari- methods are appealing in both learning (Hyvarinen¨ , 2005; ational posterior to approximate the true posterior Liu et al., 2016) and evaluating (Grathwohl et al., 2020b) in these terms. The variational posterior is trained EBMs, compared to many other approaches (Hinton, 2002; to minimize a certain divergence (e.g., the KL Tieleman, 2008; Gutmann & Hyvarinen¨ , 2010; Salakhutdi- divergence) between itself and the true posterior. nov, 2008) (see a comprehensive discussion in Sec.6). In Theoretically, the divergence characterizes upper particular, score matching (SM) (Hyvarinen¨ , 2005), which bounds of the bias of the estimates. In principle, minimizes the expected squared distance between the score our estimates can be applied to a wide range of function of the data distribution and that of the model distri- objectives, including kernelized Stein discrepancy bution, and its variants (Kingma & LeCun, 2010; Vincent, (KSD), score matching (SM)-based methods and 2011; Saremi et al., 2018; Song et al., 2019; Li et al., 2019; exact Fisher divergence with a minimal model as- Pang et al., 2020) have shown promise in learning EBMs. sumption. In particular, these estimates applied to Unfortunately, the score function and its gradient with re- SM-based methods outperform existing methods spect to the model parameters in EBLVMs are generally in learning EBLVMs on several image datasets. intractable without a strong structural assumption (e.g., the joint distribution of visible and latent variables is in the ex- ponential family (Vertes´ et al., 2016)) and thereby SF-based 1. Introduction methods are not directly applicable to general EBLVMs. Energy-based models (EBMs) (LeCun et al., 2006) asso- Recently, bi-level score matching (BiSM) (Bao et al., 2020) ciate an energy to each configuration of visible variables, applies SM to general EBLVMs by reformulating a SM- which naturally induces a probability distribution by nor- based objective as a bilevel optimization problem, which malizing the exponential negative energy. Such models have learns a deep EBLVM on natural images and outperforms an found applications in a wide range of areas, such as image EBM of the same size. However, BiSM is not problemless — synthesis (Du & Mordatch, 2019; Nijkamp et al., 2020), practically BiSM is optimized by gradient unrolling (Metz out-of-distribution detection (Grathwohl et al., 2020a; Liu et al., 2017) of the lower level optimization, which is time et al., 2020) and controllable generation (Nijkamp et al., and memory consuming (see a comparison in Sec. 4.1). 2019). Based on EBMs, energy-based latent variable mod- In this paper, we present variational estimates of the score els (EBLVMs) (Swersky et al., 2011;V ertes´ et al., 2016) function and its gradient with respect to the model param- incorporate latent variables to further improve the expres- eters in general EBLVMs, referred to as VaES and VaGES 1Dept. of Comp. Sci. & Tech., Institute for AI, THBI Lab, respectively. The score function and its gradient can be BNRist Center, State Key Lab for Intell. Tech. & Sys., Tsinghua expressed as combinations of expectation and covariance University, Beijing, China 2Huawei Noah’s Ark Lab. Correspon- terms over the posterior of the latent variables. VaES and dence to: Jun Zhu . VaGES are obtained by introducing a variational posterior, which approximates the true posterior in these terms by Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). minimizing a certain divergence between itself and the true Variational (Gradient) Estimate of the Score Function in Energy-based Latent Variable Models posterior. Theoretically, we show that under some assump- where the first term is the expectation of a random matrix tions, the bias introduced by the variational posterior can ∂∇v logp ˜θ (v,h) ∂θ with h ∼ pθ(h|v) and the second term is be bounded by the square root of the KL divergence or the the covariance between two random vectors ∇v logp ˜θ(v, h) Fisher divergence (Johnson, 2004) between the variational and ∇θ logp ˜θ(v, h) with h ∼ pθ(h|v). posterior and the true one (see Theorem1,2,3,4). VaES and VaGES are generally applicable in a wide range 2.1. Variational (Gradient) Estimate of the Score of SF-based methods, including kernelized Stein discrep- Function in EBLVMs ancy (KSD) (Liu et al., 2016) (see Sec.3) and score match- Eq. (2) and (3) naturally suggest Monte Carlo estimates, ing (SM)-based methods (Vincent, 2011; Li et al., 2019) while both of them need samples from the posterior pθ(h|v), (see Sec.4) for learning EBLVMs and estimating the ex- which is generally intractable. As for approximate inference, act Fisher divergence (Hu et al., 2018; Grathwohl et al., we present an amortized variational approach (Kingma & 2020b) (see Sec.5) between the data distribution and the Welling, 2014) that considers model distribution for evaluating EBLVMs. In particular, VaES and VaGES applied to SM-based methods are supe- Eqφ(h|v) [∇v logp ˜θ(v, h)] , (4) rior in time and memory consumption compared with the complex gradient unrolling in the strongest baseline BiSM and (see Sec. 4.1) and applicable to learn deep EBLVMs on the   MNIST (LeCun et al., 2010), CIFAR10 (Krizhevsky et al., ∂∇v logp ˜θ(v, h) 2009) and CelebA (Liu et al., 2015) datasets (see Sec. 4.2). Eqφ(h|v) ∂θ We also present latent space interpolation results of deep + Cov (∇ logp ˜ (v, h), ∇ logp ˜ (v, h)), (5) EBLVMs (see Sec. 4.2), which haven’t been investigated in qφ(h|v) v θ θ θ previous EBLVMs to our knowledge. respectively, where qφ(h|v) is a variational posterior pa- rameterized by φ. Note that the covariance term in Eq. (5) 2. Method is a variational estimate of that in Eq. (3), not derived from taking gradient to Eq. (4). We optimize some divergence As mentioned in Sec.1, the score function has been used in between q (h|v) and p (h|v) a wide range of methods (Hyvarinen¨ , 2005; Liu et al., 2016; φ θ Grathwohl et al., 2020b), while it is generally intractable min EpD (v) [D(qφ(h|v)||pθ(h|v))] , (6) in EBLVMs without a strong structural assumption. In this φ paper, we present estimates of the score function and its gradient w.r.t. the model parameters in a general EBLVM. where pD(v) denotes the data distribution. Specifically, Formally, an EBLVM defines a joint probability distribution if we set D in Eq. (6) as the KL divergence or the Fisher over the visible variables v and latent variables h as follows divergence, then Eq. (6) is tractable. For clarity, we refer the readers to Appendix B for a derivation and a general p (v, h) =p ˜ (v, h)/Z(θ) = e−Eθ (v,h)/Z(θ), (1) θ θ analysis of the tractability of other divergences. where Eθ(v, h) is the energy function parameterized by −E (v,h) Below, we consider Monte Carlo estimates based on the vari- θ, p˜θ(v, h) = e θ is the unnormalized distribution and Z(θ) = R e−Eθ (v,h)dvdh is the partition function. As ational approximation. According to Eq. (4), the variational shown byV ertes´ et al.(2016), the score function of an estimate of the score function (VaES) is EBLVM can be written as an expectation over the posterior: L 1 X ∇ log p (v) = [∇ logp ˜ (v, h)] , (2) VaES(v; θ, φ) = ∇ logp ˜ (v, h ), (7) v θ Epθ (h|v) v θ L v θ i R i=1 where pθ(v) = pθ(v, h)dh is the marginal distribution and pθ(h|v) is the posterior. We then show that the gradient i.i.d of the score function w.r.t. the model parameters can be where hi ∼ qφ(h|v) and L is the number of samples from decomposed into an expectation term and a covariance term qφ(h|v). Similarly, according to Eq. (5), the variational over the posterior (proof in Appendix A): estimate of the expectation term e(v; θ) is   ∂∇v log pθ(v) ∂∇v logp ˜θ(v, h) L = Epθ (h|v) 1 X ∂∇v logp ˜θ(v, hi) ∂θ ∂θ eˆ(v; θ, φ) = , (8) | {z } L ∂θ denoted as e(v; θ) i=1

+ Cov (∇v logp ˜θ(v, h), ∇θ logp ˜θ(v, h)), (3) pθ (h|v) i.i.d | {z } where hi ∼ qφ(h|v). As for the covariance term c(v; θ), denoted as c(v; θ) we estimate it with the sample covariance matrix (Fan et al., Variational (Gradient) Estimate of the Score Function in Energy-based Latent Variable Models

2016). According to Eq. (5), the variational estimate is

L 1 X ∂ logp ˜θ(v, hi) cˆ(v; θ, φ) = ∇ logp ˜ (v, h ) L − 1 v θ i ∂θ i=1 L L 1 X X ∂ logp ˜θ(v, hi) − ∇ logp ˜ (v, h ) , (L − 1)L v θ i ∂θ i=1 i=1 (9) (a) GRBM, discrete h (b) GM, continuous h i.i.d ∂· where hi ∼ qφ(h|v), ∇v· outputs column vectors and ∂θ Figure 1. The biases of VaES and VaGES v.s. the square root of outputs row vectors. With Eq. (8) and (9), the variational (a) the KL divergence between qφ(h|v) and pθ(h|v) in a GRBM gradient estimate of the score function (VaGES) is and (b) the Fisher divergence between qφ(h|v) and pθ(h|v) in a Gaussian model (GM). See Appendix F.4 for experimental details. VaGES(v; θ, φ) =e ˆ(v; θ, φ) +c ˆ(v; θ, φ). (10) In Fig.1 (a), we numerically validate Theorem1,2 in Gaus- Remark: Although Eq. (8) includes second derivatives, in sian restricted Boltzmann machines (GRBMs) (Hinton & practice we only need to calculate the product of them and Salakhutdinov, 2006). The biases of VaES and VaGES in vectors. Similar to the Hessian-vector products (Song et al., such models are linearly proportional to the square root of 2019), only two backpropagations are required in the calcu- > the KL divergence. Theorem1,2 strongly motivate us to set > ∂∇v logp ˜θ (v,hi) ∂z ∇v logp ˜θ (v,hi) lation (since z ∂θ = ∂θ ). D as the KL divergence in Eq. (6) when h is discrete. Then, we show that under extra assumptions on the Stein 2.2. Bounding the Bias regularity (see Def.2) and boundedness of the Stein factors Notice that VaES(v; θ, φ) and VaGES(v; θ, φ) are ac- (see Def.3), the bias of VaES(v; θ, φ) and VaGES(v; θ, φ) tually biased estimates due to the difference between can be bounded by the square root of the Fisher diver- qφ(h|v) and pθ(h|v). Firstly, we show that the bias of gence (Johnson, 2004) between qφ(h|v) and pθ(h|v), as VaES(v; θ, φ) and VaGES(v; θ, φ) can be bounded by characterized in Theorem3 and Theorem4. We first intro- the square root of the KL divergence between qφ(h|v) and duce some definitions, which will be used in the theorems. pθ(h|v) under some assumptions on boundedness, as char- Definition 1. (Ley et al., 2013) Suppose p is a probability acterized in Theorem1 and Theorem2. n n density defined on R and f : R → R is a function, we define gp as a solution of the Stein equation S g = f − f, Theorem 1. (VaES, KL, proof in Appendix C) Suppose f p Ep n n > where g : → and Spg(x) ∇x log p(x) g(x) + ∇v logp ˜θ(v, h) is bounded w.r.t. v, h and θ, then the bias R R , of VaES(v; θ, φ) can be bounded by the square root of the Tr(∇xg(x)). (See Appendix D for the existence of the solution.) KL divergence between qφ(h|v) and pθ(h|v) up to multi- plying a constant. Definition 2. Suppose p, q are probability densities de- n n m Theorem 2. (VaGES, KL, proof in Appendix C) Suppose fined on R and f : R → R is a function, we ∂∇v logp ˜θ (v,h) say f satisfies the Stein regular condition w.r.t. p, q iff ∇v logp ˜θ(v, h), ∇θ logp ˜θ(v, h) and ∂θ are p ∀i ∈ Z ∩ [1, m], lim q(x)gf (x) = 0. bounded w.r.t. v, h and θ, then the bias of VaGES(v; θ, φ) ||x||→∞ i can be bounded by the square root of the KL divergence Definition 3. (Ley et al., 2013) Suppose p, q are probability between qφ(h|v) and pθ(h|v) up to multiplying a constant. densities defined on Rn and f : Rn → Rm is a function satisfying the Stein regular condition w.r.t. p, q, we define Remark: The boundedness assumptions in Theorem1,2 s m are not difficult to satisfy when h is discrete. For example, if κp,q P ||gp (x)||2, referred to as the Stein fac- f , Eq(x) fi 2 v comes from a compact set (which is naturally satisfied on i=1 image data), θ comes from a compact set (which can be sat- tor of f w.r.t. p, q. isfied by adding weight decay to θ), logp ˜θ(v, h) is second Theorem 3. (VaES, Fisher, proof in Appendix D) Suppose continuously differentiable w.r.t. v and θ (which is satisfied (1) ∀(v, θ, φ), ∇v logp ˜θ(v, h) as a function of h satisfies in neural networks composed of smooth functions) and h the Stein regular condition w.r.t. pθ(h|v) and qφ(h|v) and comes from a finite set (which is satisfied in most commonly (2) the Stein factor of ∇v logp ˜θ(v, h) as a function of h used discrete distributions, e.g., the Bernoulli distribution), w.r.t. pθ(h|v), qφ(h|v) is bounded w.r.t. v, θ and φ, then ∂∇v logp ˜θ (v,h) then ∇v logp ˜θ(v, h), ∇θ logp ˜θ(v, h) and ∂θ the bias of VaES(v; θ, φ) can be bounded by the square are bounded w.r.t. v, h and θ by the extreme value theorem root of the Fisher divergence between qφ(h|v) and pθ(h|v) of continuous functions on compact sets. up to multiplying a constant. Variational (Gradient) Estimate of the Score Function in Energy-based Latent Variable Models

Theorem 4. (VaGES, Fisher, proof in Appendix D) Suppose (1) ∀(v, θ, φ), ∇v logp ˜θ(v, h), ∇θ logp ˜θ(v, h), ∂ logp ˜θ (v,h) ∂∇v logp ˜θ (v,h) ∇v logp ˜θ(v, h) ∂θ and ∂θ as func- tions of h satisfy the Stein regular condition w.r.t. pθ(h|v) and qφ(h|v) and (2) the Stein factors of ∇v logp ˜θ(v, h), ∂ logp ˜θ (v,h) ∇θ logp ˜θ(v, h), ∇v logp ˜θ(v, h) ∂θ and ∂∇v logp ˜θ (v,h) ∂θ as functions of h w.r.t. pθ(h|v), qφ(h|v) are bounded w.r.t. v, θ and φ, (3) ∇v logp ˜θ(v, h) and ∇θ logp ˜θ(v, h) are bounded w.r.t. v, h and θ, then the bias of VaGES(v; θ, φ) can be bounded by the square root of the Fisher divergence between qφ(h|v) and pθ(h|v) up to multiplying a constant. Figure 2. Comparison of KSD, VaGES-KSD and IS-KSD on checkerboard. The test log-likelihood is averaged over 10 runs. Although the boundedness of the Stein factors has only been verified under some simple cases (Ley et al., 2013), 0 and hasn’t been extended to more complex cases, e.g., when k(v, v ) is defined as p˜ (v, h) is parameterized by a neural network, we find it θ KSD(p , p ) work in practice when choosing D in Eq. (6) as the Fisher D θ > 0 0 0 0 divergence (see Sec. 4.2) to learn qφ(h|v). In Fig.1 (b), , Ev,v ∼pD [∇v log pθ(v) k(v, v )∇v log pθ(v ) we numerically validate Theorem3,4 in a Gaussian model > 0 + ∇v log pθ(v) ∇v0 k(v, v ) (GM), whose energy is E (v, h) = 1 ||v − b||2 + 1 ||h − θ 2σ2 2 0 > 0 c||2 − v>W h with θ = (σ, W, b, c). The biases of VaES + ∇vk(v, v ) ∇v0 log pθ(v ) 0 and VaGES in such models are linearly proportional to the + Tr(∇v∇v0 k(v, v ))], (12) square root of the Fisher divergence. which properly measures the difference between pD(v) and 2.3. Langevin Dynamics Corrector pθ(v) under some mild assumptions (Liu et al., 2016). To learn an EBLVM, we use the gradient-based optimization Sometimes p (h|v) can be complex for q (h|v) to ap- θ φ to minimize KSD(pD, pθ), where the gradient w.r.t. θ is proximate, especially in deep models. To improve the in- ference performance, we can run a few Langevin dynam- ∂KSD(pD, pθ)  0 = 2 0 (k(v, v )∇ log p (v) ics steps (Welling & Teh, 2011) to correct samples from ∂θ Ev,v ∼pD v θ 0 qφ(h|v), which has been successfully applied to correct ∂∇ 0 log p (v ) + ∇ k(v, v0))> v θ . (13) the solution of a numerical SDE solver (Song et al., 2020). v ∂θ Specifically, a Langevin dynamics step updates h by 0 ∂∇v0 log pθ (v ) Estimating ∇v log pθ(v) and ∂θ with VaES and VaGES respectively, the variational stochastic gradient esti- α h ← h + ∇h log pθ(h|v) +,  ∼ N (0, α). (11) mate of KSD (VaGES-KSD) is 2 | {z } = M ∇h logp ˜θ(v, h) 2 X [k(v , v0)VaES(v ; θ, φ) M i i i i=1 0 > 0 + ∇vk(vi, vi)] VaGES(vi; θ, φ), (14) √In practice, the standard deviation of  is often smaller than 0 α to allow a faster convergence (Du & Mordatch, 2019; where the union of v1:M and v1:M is a mini-batch from the 0 Nijkamp et al., 2020; 2019; Grathwohl et al., 2020a). Since data distribution, and VaES(vi; θ, φ) and VaGES(vi; θ, φ) Langevin dynamics requires pθ(h|v) to be differentiable are independent. w.r.t. h, we only apply the corrector when h is continuous. 3.1. Experiments 3. Learning EBLVMs with KSD Setting. Since KSD itself suffers from the curse of dimen- In this section, we show that VaES and VaGES can extend sionality issue (Gong et al., 2020), we illustrate the valid- kernelized Stein discrepancy (KSD) (Liu et al., 2016) to ity of VaGES-KSD by learning Gaussian restricted Boltz- learn general EBLVMs. The KSD between the data distribu- mann machines (GRBMs) (Welling et al., 2004; Hinton tion pD(v) and the model distribution pθ(v) with the kernel & Salakhutdinov, 2006) on the 2-D checkerboard dataset, Variational (Gradient) Estimate of the Score Function in Energy-based Latent Variable Models whose distribution is shown in Appendix G.1. The energy where pD(w, v) is the joint distribution of w and v (specif- function of a GRBM is ically, pD(w, v) = pD(w)pσ (v|w) for Eq. (16) and R 0 1 pD(w, v) = pD(w)p(σ)pσ(v|w)dσ for Eq. (17)). We E (v, h) = ||v − b||2 − c>h − v>W h, (15) σ θ 2σ2 use gradient-based optimization to minimize J (θ) and its with learnable parameters θ = (σ, W, b, c). Since GRBM gradient w.r.t. θ is has a tractable posterior, we can directly learn it using KSD J (θ) n (see Eq. (12)). We also compare another baseline where = p (w,v) ∇v log pθ(v) ∂θ E D p˜ (v) is estimated by importance sampling (Owen, 2013) θ ∂∇ log p (v)o L − ∇ log p (v|w) v θ . 1 P p˜θ (v,hi) i.i.d v σ0 (19) (i.e., p˜θ(v) ≈ , hi ∼ unif(h), where unif ∂θ L unif(hi) i=1 means the uniform distribution) and we call it IS-KSD. We ∂∇v log pθ (v) 0 2 Estimating ∇v log pθ(v) and ∂θ with VaES and 0 ||v−v ||2 use the RBF kernel k(v, v ) = exp(− 2σ2 ) with σ = VaGES respectively, the variational stochastic gradient esti- 0.1. qφ(h|v) is a Bernoulli distribution parametermized mate of the SM-based methods (VaGES-SM) is by a fully connected layer with the sigmoid activation and M we use the Gumbel-Softmax trick (Jang et al., 2017) for 1 X  VaES(vi; θ, φ) reparameterization of qφ(h|v) with 0.1 as the temperature. M D in Eq. (6) is the KL divergence. See Appendix F.1 for i=1  more experimental details. − ∇v log pσ0 (vi|wi) VaGES(vi; θ, φ), (20)

Result. As shown in Fig.2, VaGES-KSD outperforms IS- where (w1:M , v1:M ) is a mini-batch from pD(w, v), and KSD on all L (the number of samples from qφ(h|v)) and VaES(vi; θ, φ) and VaGES(vi; θ, φ) are independent. We is comparable to the KSD (the test log-likelihood curves of explicitly denote our methods as VaGES-DSM or VaGES- VaGES-KSD are very close to KSD when L = 5, 10). MDSM according to which objective is used.

4. Learning EBLVMs with Score Matching 4.1. Comparison in GRBMs In this section, we show that VaES and VaGES can extend Setting. We firstly consider the GRBM (see Eq. (15)), two score matching (SM)-based methods (Vincent, 2011; Li which is a good benchmark to compare existing methods. et al., 2019) to learn general EBLVMs. The first one is the We consider the Frey face dataset, which consists of gray- denoising score matching (DSM) (Vincent, 2011), which scaled face images of size 20 × 28, and the checkerboard q (h|v) minimizes the Fisher divergence DF between a perturbed dataset (mentioned in Sec. 3.1). φ is a Bernoulli dis- tribution parametermized by a fully connected layer with data distribution pσ0 (v) and the model distribution pθ(v): the sigmoid activation and we use the Gumbel-Softmax JDSM (θ) DF (pσ ||pθ) (16) , 0 trick (Jang et al., 2017) for reparameterization of qφ(h|v) 1 2 with 0.1 as the temperature. D in Eq. (6) is the KL diver- ≡ EpD (w)pσ (v|w)||∇v log pθ(v) − ∇v log pσ0 (v|w)||2, 2 0 gence. See Appendix F.2.1 for more experimental details. where pD(w) is the data distribution, pσ0 (v|w) = 2 Comparison with BiSM. As mentioned in Sec.1, bi-level N (v|w, σ0I) is the Gaussian perturbation with a fixed noise score matching (BiSM) (Bao et al., 2020) learns general level σ , p (v) = R p (w)p (v|w)dw is the perturbed 0 σ0 D σ0 EBLVMs by reformulating Eq. (16) and (17) as bi-level data distribution and ≡ means equivalence in parameter op- optimization problems, referred to as BiDSM and BiMDSM timization. The second one is the multiscale denoising score respectively, which serves as the most direct baseline (see matching (MDSM) (Li et al., 2019), which uses different Appendix I for an introduction to BiSM). We make a com- noise levels to scale up DSM to high-dimensional data: prehensive comparison between VaGES-SM and BiSM on JMDSM (θ) (17) the Frey face dataset, which explores all hyperparameters 1 involved in these two methods. Specifically, both meth- ||∇ log p (v)−∇ log p (v|w)||2, ,2EpD (w)p(σ)pσ (v|w) v θ v σ0 2 ods sample from a variational posterior and optimize the variational posterior, so they share two hyperparameters L where p(σ) is a prior distribution over the flexible noise (the number of h sampled from the variational posterior) level σ and σ is a fixed noise level. We extend DSM and 0 and K (the number of times updating the variational pos- MDSM to learn EBLVMs. Firstly we write Eq. (16) and (17) terior on each mini-batch). Besides, BiSM has an extra in a general form for the convenience of discussion hyperparameter N, which specifies the number of gradient 1 2 unrolling steps. In Fig.3, we plot the test log-likelihood, J (θ)= EpD (w,v)||∇v log pθ(v)−∇v log pσ0 (v|w)||2, 2 the test score matching (SM) loss (Hyvarinen¨ , 2005), which (18) is the Fisher divergence up to an additive constant, the time Variational (Gradient) Estimate of the Score Function in Energy-based Latent Variable Models

Table 1. FID on CIFAR10 and CelebA (64 × 64). † Averaged by 5 runs. ‡ Since BiSM doesn’t report a FID on CelebA, the value is evaluated in our reproduction. (a) CIFAR10 Methods FID ↓ Flow-CE (Gao et al., 2020) 37.30 (a) Test log-likelihood ↑ (b) Test SM loss ↓ VAE-EBLVM (Han et al., 2020) 30.1 CoopNets (Xie et al., 2018) 33.61 EBM (Du & Mordatch, 2019) 38.2 MDSM (Li et al., 2019) 31.7 BiMDSM (Bao et al., 2020) 29.43 ± 2.76† VaGES-MDSM (ours) 28.93 ± 1.91†

(b) CelebA Methods FID ↓ (c) Time (s) ↓ (d) Memory (MB) ↓ BiMDSM (Bao et al., 2020) 32.43‡ Figure 3. Comparing VaGES-DSM and BiDSM on Frey face. The VaGES-MDSM (ours) 31.53 test SM loss is the Fisher divergence up to a additive constant. and memory consumption by grid search across L, K and et al., 2015), which consists of color face images. We resize N. VaGES-DSM takes about 40% time and 90% memory CelebA to 64 × 64. qφ(h|v) is a Gaussian distribution pa- on average to achieve a similar performance with BiDSM rameterized by a 3-layer convolutional neural network. D (N=5). VaGES-DSM outperforms BiDSM (N=0,2), and in Eq. (6) is the Fisher divergence. We use the Langevin meanwhile it takes less time than BiDSM (N=2) and the dynamics corrector (see Sec. 2.3) to correct samples from least memory. qφ(h|v). As for the step size α in Langevin dynamics, we empirically find that the optimal one is approximately Comparison with other methods. For completeness, we proportional to the dimension of h (see Appendix F.2.2), also compare with DSM (Vincent, 2011), CD-based meth- perhaps because Langevin dynamics converges to its sta- ods (Hinton, 2002; Tieleman, 2008) and noise contrastive tionary distribution slower when h has a higher dimension estimation (NCE)-based methods (Gutmann & Hyvarinen¨ , and thereby requires a larger step size, so we fix the ratio of 2010; Rhodes & Gutmann, 2019) on the checkerboard the step size α in Langevin dynamics to the dimension of h dataset. VaGES-DSM is competitive to the baselines (e.g., on one dataset and the ratio is 2 × 10−4 on MNIST and CI- CD and DSM) that leverage the tractability of the posterior FAR10 and 10−5 on CelebA. The standard deviation of the in GRBM and outperforms the variational NCE (Rhodes & noise  in Langevin dynamics is 10−4. See Appendix F.2.2 Gutmann, 2019). See results in Appendix G.2.1. for more experimental details, e.g., how hyperparameters are selected and time consumption. 4.2. Learning Deep EBLVMs Sample quality. Since the EBLVM defined by Eq. (21) has Setting. Then we show that VaGES-SM can scale up to an intractable posterior, BiSM is the only applicable base- learn general deep EBLVMs and generate natural images. line mentioned in Section 4.1. We quantitatively evaluate Following BiSM (Bao et al., 2020), we consider the deep the sample quality with the FID score (Heusel et al., 2017) EBLVM with the following energy function in Tab.1 and VaGES-MDSM achieves comparable perfor- mance to BiMDSM. Since there are relatively few baselines Eθ(v, h) = g3(g2(g1(v; θ1), h); θ2), (21) of learning deep EBLVMs, we also compare with other base- lines involving learning EBMs in Tab.1. We mention that where h is continuous, θ = (θ , θ ), g a vector-valued 1 2 1 VAE-EBLVM (Han et al., 2020) and CoopNets (Xie et al., neural network that outputs a feature sharing the same di- 2018) report FID results on a subset of CelebA and on a mension with h, g is an additive coupling layer (Dinh et al., 2 different resolution of CelebA respectively. For fairness, we 2015) and g is a scalar-valued neural network. We consider 3 don’t include these results on CelebA. We provide image the MNIST dataset (LeCun et al., 2010), which consists of samples and Inception Score results in Appendix G.2.2. gray-scaled hand-written digits of size 28×28, the CIFAR10 dataset (Krizhevsky et al., 2009), which consists of color Interpolation in the latent space. As argued in prior natural images of size 32 × 32 and the CelebA dataset (Liu work (Bao et al., 2020), the conditional distribution pθ(v|h) Variational (Gradient) Estimate of the Score Function in Energy-based Latent Variable Models

(a) MNIST (b) CIFAR10 (c) CelebA

Figure 4. Interpolation of annealed Langevin dynamics trajectories in the latent space. Each row displays a trajectory of pθ(v|h) for a fixed latent variable h and h changes smoothly between rows. Different trajectories share the same initial point. of a deep EBLVM can be multimodal, so we want to ex- to measure how well the model distribution pθ(v) approxi- plore how a local modal of pθ(v|h) evolves as h changes. mates pD(v), which needs an absolute value representing Since v is of high dimension, we are not able to plot the the difference between them. This task is more difficult landscape of pθ(v|h). Alternatively, we study how a fixed than model comparison, which only needs relative values particle moves in different landscapes according to the an- (e.g., log-likelihood) to compare between different models. nealed Langevin dynamics (Li et al., 2019) by interpolating Specifically, we consider the Fisher divergence of the max- h. Specifically, we select two points in the latent space and imum form (Hu et al., 2018; Grathwohl et al., 2020b) to interpolate linearly between them to get a class of condi- measure how well pθ(v) approximates pD(v): T tional distributions {pθ(v|hi)}i=1. Then we fix an initial point v0 and run annealed Langevin dynamics on these con- ditional distributions with a shared noise. These annealed m  > Langevin dynamics trajectories are shown in Fig.4. Starting DF (pD||pθ) max EpD (v)Ep() ∇v log pθ(v) f(v) , f∈F from the same initial noise, the trajectory evolves smoothly 1 h > 2 as changes smoothly. More interpolation results can be +  ∇vf(v) − ||f(v)||2 , (22) found in Appendix G.2.2. 2 Sensitivity analysis. We also study how hyperparameters influence the performances of VaGES-SM in deep EBLVMs, including the dimension of h, the number of convolutional where F is the set of functions f : Rd → Rd satisfying layers, the number of h sampled from qφ(h|v), the number lim||v||→∞ pD(v)f(v) = 0 and p() is a noise distribution of times of updating φ, the number of Langevin dynamics (e.g., Gaussian distribution) introduced for computation steps, the noise level in Langevin dynamics and which di- efficiency (Grathwohl et al., 2020b). Such a form can also vergence to use in Eq. (6). These results can be found in be understood as the Stein discrepancy (Gorham & Mackey, Appendix G.2.2. In brief, increasing the number of con- 2017) with L2 constraint on the function space F and has volutional layers will improve the performance, while the been applied to evaluate GRBMs (Grathwohl et al., 2020b). dimension of h, the number of h sampled from qφ(h|v), Under some mild assumptions (see Appendix E), Eq. (22) the noise level in Langevin dynamics and using the KL di- is equal to the exact Fisher divergence. In practice, F is vergence instead of the Fisher divergence in Eq. (6) don’t approximated by a neural network {fη : η ∈ H}, where η affect the result very much. Besides, setting both the number is the parameter and H is the parameter space. We optimize of times updating φ and the number of Langevin dynamics the right hand side of Eq. (22) and the gradient w.r.t. η is steps to 5 is enough for a stable training process.

5. Evaluating EBLVMs with Exact Fisher ∂f (v) ∇ log p (v)> η Divergence EpD (v)Ep() v θ ∂η > 2 In this setting we are given an EBLVM and a set of sam- ∂ ∇vfη(v) 1 ∂||fη(v)||2  n + − . (23) ples {vi}i=1 from the data distribution pD(v). We want ∂η 2 ∂η Variational (Gradient) Estimate of the Score Function in Energy-based Latent Variable Models

and DBMs (Salakhutdinov & Hinton, 2009), the greedy layer-wise learning algorithm (Hinton et al., 2006; Ben- gio et al., 2006) is applicable by exploring their hierarchical structures. Some recent methods (Kuleshov & Ermon, 2017; Li et al., 2020) attempt to learn general EBLVMs by varia- tional inference. However, such methods suffer from high (Kuleshov & Ermon, 2017) or high bias (Li et al., 2020) of the variational bounds for the partition function in high-dimensional space. Learning EBLVMs without partition functions. As men- Figure 5. The Fisher divergence estimated by VaGES-Fisher v.s. tioned in Sec.1, the score matching (SM) is a popular the accurate Fisher divergence (Fisher). class of learning methods, which eliminate the partition function by considering the score function, and the recent Estimating ∇ log p (v) with VaES, the variational stochas- v θ bi-level score matching (BiSM) is a scalable extension of tic gradient estimate is SM to learn general EBLVMs. Prior to BiSM, extensions M of SM (Swersky et al., 2011;V ertes´ et al., 2016) impose 1 X ∂fη(vi) VaES(v ; θ, φ)> structural assumptions on the model, e.g., a tractable pos- M i ∂η i=1 terior (Swersky et al., 2011) or a jointly exponential fam- ∂>∇ f (v ) 1 ∂||f (v )||2 ily model (Vertes´ et al., 2016). Recently, Song & Ermon + i v η i i − η i 2 , (24) ∂η 2 ∂η (2019; 2020) use a score network to directly model the score function of the data distribution. Incorporating la- tent variables to such a model is currently an open problem where v1:M is a mini-batch from pD(v) and 1:M is a mini- and we leave it as future work. The noise contrastive es- batch from p(). We refer to this method as VaGES-Fisher. timation (NCE) (Gutmann & Hyvarinen¨ , 2010) is also a learning method which eliminates the partition function and 5.1. Experiments variational noise contrastive estimation (VNCE) (Rhodes & We validate the effectiveness of VaGES-Fisher in GRBMs. Gutmann, 2019) is an extension of NCE to general EBLVMs We initialize a GRBM as pD(v), perturb its weight with based on variational methods. However, such methods have increasing noise and estimate the Fisher divergence between to manually design a noise distribution, which in principle the initial GRBM and the perturbed one by VaGES-Fisher. should be close to the data distribution and is challeng- The dimensions of v and h are the same and we experiment ing in high-dimensional space. There are alternative ap- on dimensions of 200 and 500. See Appendix F.3 for more proaches to handling the partition function, e.g., Doubly experimental details. The result is shown in Fig.5. We Dual Embedding (Dai et al., 2019a), Adversarial Dynam- compare the Fisher divergence estimated by VaGES-Fisher ics Embedding (Dai et al., 2019b) and Fenchel Mini-Max with the accurate Fisher divergence. Under both dimensions, Learning (Tao et al., 2019), and extending theses methods our estimated Fisher divergence is close to the accurate one. for learning EBLVMs would be interesting future work. Evaluating EBLVMs. The log-likelihood is commonly 6. Related Work used to compare fully visible models on the same dataset and Learning EBLVMs with MLE. A popular class of learn- the annealed importance sampling (Neal, 2001; Salakhutdi- ing methods is the maximum likelihood estimate (MLE). nov, 2008) is often used to estimate the partition function in Such methods have to deal with both the model posterior the log-likelihood. For a general EBLVM, a lower bound of and the partition function. Some methods impose structural the log-likelihood (Kingma & Welling, 2014; Burda et al., assumptions on the model. Among them, contrastive di- 2016) can be estimated by introducing a variational pos- vergence (CD) (Hinton, 2002) and its variants (Tieleman, terior together with estimating the partition function. The 2008; Qiu et al., 2020) estimate the gradient of the partition log-likelihood is a relative value and the corresponding ab- function with respect to the model parameters using Gibbs solute value (i.e., the KL divergence) is generally intractable sampling (Geman & Geman, 1984) when the posterior can since the entropy of the data distribution is unknown. be sampled efficiently. Furthermore, when the model is fully visible, scalable methods (Du et al., 2018; Du & Mordatch, 7. Conclusion 2019; Nijkamp et al., 2020; 2019; Grathwohl et al., 2020a) are applicable by estimating such gradient with Langevin We propose new variational estimates of the score function dynamics. In deep models with multiple layers of latent vari- and its gradient with respect to the model parameters in gen- ables such as DBNs (Hinton et al., 2006; Lee et al., 2009) eral energy-based latent variable models (EBLVMs). We Variational (Gradient) Estimate of the Score Function in Energy-based Latent Variable Models rewrite the score function and its gradient as combinations on Artificial Intelligence and , AISTATS 2019, of expectation and covariance terms over the model poste- volume 89 of Proceedings of Machine Learning Research, rior. We approximate the model posterior with a variational pp. 2321–2330. PMLR, 2019a. posterior and analyze its bias. With samples drawn from the variational posterior, the expectation terms are estimated Dai, B., Liu, Z., Dai, H., He, N., Gretton, A., Song, L., by Monte Carlo and the covariance terms are estimated and Schuurmans, D. Exponential family estimation via by sample covariance matrix. We apply our estimates to adversarial dynamics embedding. In Advances in Neural kernelized Stein discrepancy (KSD), score matching (SM)- Information Processing Systems 32, pp. 10977–10988, based methods and exact Fisher divergence. In particular, 2019b. these estimates applied to SM-based methods are more time Dinh, L., Krueger, D., and Bengio, Y. NICE: non-linear and memory efficient compared with the complex gradient independent components estimation. In 3rd International unrolling in the strongest baseline bi-level score match- Conference on Learning Representations, 2015. ing (BiSM) and meanwhile can scale up to natural images in deep EBLVMs. Besides, we also present interpolation Du, C., Xu, K., Li, C., Zhu, J., and Zhang, B. Learning im- results of annealed Langevin dynamics trajectories in the plicit generative models by teaching explicit ones. arXiv latent space in deep EBLVMs. preprint arXiv:1807.03870, 2018.

Du, Y. and Mordatch, I. Implicit generation and modeling Software and Data with energy based models. In Advances in Neural Infor- Codes in https://github.com/baofff/VaGES. mation Processing Systems 32, pp. 3603–3613, 2019. Fan, J., Liao, Y., and Liu, H. An overview of the estimation Acknowledgements of large covariance and precision matrices, 2016. This work was supported by NSFC Projects (Nos. Gao, R., Nijkamp, E., Kingma, D. P., Xu, Z., Dai, A. M., and 61620106010, 62061136001, 61621136008, U1811461, Wu, Y. N. Flow contrastive estimation of energy-based 62076145), Beijing NSF Project (No. JQ19016), Bei- models. In 2020 IEEE/CVF Conference on Computer jing Academy of Artificial Intelligence (BAAI), Tsinghua- Vision and Pattern Recognition, CVPR 2020, Seattle, WA, Huawei Joint Research Program, a grant from Tsinghua USA, June 13-19, 2020, pp. 7515–7525. IEEE, 2020. Institute for Guo Qiang, Tiangong Institute for Intel- ligent Computing, and the NVIDIA NVAIL Program Geman, S. and Geman, D. Stochastic relaxation, gibbs dis- with GPU/DGX Acceleration. C. Li was supported by tributions, and the bayesian restoration of images. IEEE the fellowship of China postdoctoral Science Foundation Transactions on pattern analysis and machine intelli- (2020M680572), and the fellowship of China national post- gence, (6):721–741, 1984. doctoral program for innovative talents (BX20190172) and Gong, W., Li, Y., and Hernandez-Lobato,´ J. M. Shuimu Tsinghua Scholar. Sliced kernelized stein discrepancy. arXiv preprint arXiv:2006.16531, 2020. References Gorham, J. and Mackey, L. W. Measuring sample quality Bao, F., Li, C., Xu, T., Su, H., Zhu, J., and Zhang, B. with kernels. In Proceedings of the 34th International Bi-level score matching for learning energy-based latent Conference on Machine Learning, volume 70 of Pro- variable models. In Advances in Neural Information ceedings of Machine Learning Research, pp. 1292–1301. Processing Systems 33, 2020. PMLR, 2017.

Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. Grathwohl, W., Wang, K., Jacobsen, J., Duvenaud, D., Greedy layer-wise training of deep networks. In Advances Norouzi, M., and Swersky, K. Your classifier is secretly in Neural Information Processing Systems 19, pp. 153– an energy based model and you should treat it like one. 160. MIT Press, 2006. In 8th International Conference on Learning Representa- tions. OpenReview.net, 2020a. Burda, Y., Grosse, R. B., and Salakhutdinov, R. Importance weighted autoencoders. In 4th International Conference Grathwohl, W., Wang, K., Jacobsen, J., Duvenaud, D., and on Learning Representations, 2016. Zemel, R. S. Learning the stein discrepancy for training and evaluating energy-based models without sampling. In Dai, B., Dai, H., Gretton, A., Song, L., Schuurmans, D., and Proceedings of the 37th International Conference on Ma- He, N. Kernel exponential family estimation via doubly chine Learning, volume 119 of Proceedings of Machine dual embedding. In The 22nd International Conference Learning Research, pp. 3732–3747. PMLR, 2020b. Variational (Gradient) Estimate of the Score Function in Energy-based Latent Variable Models

Gutmann, M. and Hyvarinen,¨ A. Noise-contrastive esti- Kuleshov, V. and Ermon, S. Neural variational inference mation: A new estimation principle for unnormalized and learning in undirected graphical models. In Advances statistical models. In Proceedings of the Thirteenth Inter- in Neural Information Processing Systems 30, pp. 6734– national Conference on Artificial Intelligence and Statis- 6743, 2017. tics, pp. 297–304, 2010. LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., and Huang, Han, T., Nijkamp, E., Zhou, L., Pang, B., Zhu, S., and F. A tutorial on energy-based learning. Predicting struc- Wu, Y. N. Joint training of variational auto-encoder and tured data, 1(0), 2006. latent energy-based model. In 2020 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, CVPR LeCun, Y., Cortes, C., and Burges, C. Mnist hand- 2020, Seattle, WA, USA, June 13-19, 2020, pp. 7975– written digit database. ATT Labs [Online]. Available: 7984. IEEE, 2020. http://yann.lecun.com/exdb/mnist, 2, 2010.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Lee, H., Grosse, R. B., Ranganath, R., and Ng, A. Y. Con- Hochreiter, S. Gans trained by a two time-scale update volutional deep belief networks for scalable unsupervised rule converge to a local nash equilibrium. In Advances learning of hierarchical representations. In Proceedings in Neural Information Processing Systems 30, pp. 6626– of the 26th Annual International Conference on Machine 6637, 2017. Learning, ICML 2009, volume 382 of ACM International Conference Proceeding Series, pp. 609–616. ACM, 2009. Hinton, G. E. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771– Ley, C., Swan, Y., et al. Stein’s density approach and in- 1800, 2002. formation inequalities. Electronic Communications in Probability, 18, 2013. Hinton, G. E. and Salakhutdinov, R. R. Reducing the di- mensionality of data with neural networks. science, 313 Li, C., Du, C., Xu, K., Welling, M., Zhu, J., and Zhang, B. (5786):504–507, 2006. To relieve your headache of training an mrf, take advil. In 8th International Conference on Learning Representa- Hinton, G. E., Osindero, S., and Teh, Y.-W. A fast learning tions. OpenReview.net, 2020. algorithm for deep belief nets. Neural computation, 18 (7):1527–1554, 2006. Li, Z., Chen, Y., and Sommer, F. T. Annealed denoising score matching: Learning energy-based models in high- Hu, T., Chen, Z., Sun, H., Bai, J., Ye, M., and Cheng, G. dimensional spaces. arXiv preprint arXiv:1910.07762, Stein neural sampler. arXiv preprint arXiv:1810.03545, 2019. 2018. Liu, Q., Lee, J. D., and Jordan, M. I. A kernelized stein dis- Hyvarinen,¨ A. Estimation of non-normalized statistical crepancy for goodness-of-fit tests. In Proceedings of the models by score matching. Journal of Machine Learning 33nd International Conference on Machine Learning, vol- Research, 6(Apr):695–709, 2005. ume 48 of JMLR Workshop and Conference Proceedings, Jang, E., Gu, S., and Poole, B. Categorical reparameteri- pp. 276–284. JMLR.org, 2016. zation with gumbel-softmax. In 5th International Con- Liu, W., Wang, X., Owens, J. D., and Li, Y. Energy-based ference on Learning Representations. OpenReview.net, out-of-distribution detection. In Advances in Neural In- 2017. formation Processing Systems 33, 2020. Johnson, O. T. and the central limit Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face theorem. World Scientific, 2004. attributes in the wild. In 2015 IEEE International Con- Kingma, D. P. and LeCun, Y. Regularized estimation of ference on Computer Vision, ICCV 2015, Santiago, Chile, image statistics by score matching. In Advances in Neu- December 7-13, 2015, pp. 3730–3738. IEEE Computer ral Information Processing Systems 23, pp. 1126–1134. Society, 2015. Curran Associates, Inc., 2010. Metz, L., Poole, B., Pfau, D., and Sohl-Dickstein, J. Un- Kingma, D. P. and Welling, M. Auto-encoding variational rolled generative adversarial networks. In 5th Interna- bayes. In 2nd International Conference on Learning tional Conference on Learning Representations. OpenRe- Representations, 2014. view.net, 2017.

Krizhevsky, A., Hinton, G., et al. Learning multiple layers Neal, R. M. Annealed importance sampling. Statistics and of features from tiny images. 2009. computing, 11(2):125–139, 2001. Variational (Gradient) Estimate of the Score Function in Energy-based Latent Variable Models

Nijkamp, E., Hill, M., Zhu, S., and Wu, Y. N. Learning Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Er- non-convergent non-persistent short-run MCMC toward mon, S., and Poole, B. Score-based generative modeling energy-based model. In Advances in Neural Information through stochastic differential equations. arXiv preprint Processing Systems 32, pp. 5233–5243, 2019. arXiv:2011.13456, 2020.

Nijkamp, E., Hill, M., Han, T., Zhu, S., and Wu, Y.N. On the Srivastava, N. and Salakhutdinov, R. Multimodal learning anatomy of mcmc-based maximum likelihood learning with deep boltzmann machines. In Advances in Neu- of energy-based models. In The Thirty-Fourth AAAI Con- ral Information Processing Systems 25, pp. 2231–2239, ference on Artificial Intelligence, pp. 5272–5280. AAAI 2012. Press, 2020. Swersky, K., Ranzato, M., Buchman, D., Marlin, B. M., Owen, A. B. Monte Carlo theory, methods and examples. and de Freitas, N. On autoencoders and score matching 2013. for energy based models. In Proceedings of the 28th International Conference on Machine Learning, pp. 1201– Pang, T., Xu, K., Li, C., Song, Y., Ermon, S., and Zhu, J. Ef- 1208. Omnipress, 2011. ficient learning of generative models via finite-difference Tao, C., Chen, L., Dai, S., Chen, J., Bai, K., Wang, D., Feng, score matching. In Advances in Neural Information Pro- J., Lu, W., Bobashev, G. V., and Carin, L. On fenchel cessing Systems 33, 2020. mini-max learning. In Advances in Neural Information Processing Systems 32 Qiu, Y., Zhang, L., and Wang, X. Unbiased contrastive diver- , pp. 10427–10439, 2019. gence algorithm for training energy-based latent variable Tieleman, T. Training restricted boltzmann machines using models. In 8th International Conference on Learning approximations to the likelihood gradient. In Machine Representations. OpenReview.net, 2020. Learning, Proceedings of the Twenty-Fifth International Conference (ICML 2008), volume 307 of ACM Inter- Rhodes, B. and Gutmann, M. U. Variational noise- national Conference Proceeding Series, pp. 1064–1071. The 22nd International Con- contrastive estimation. In ACM, 2008. ference on Artificial Intelligence and Statistics, AISTATS 2019, volume 89 of Proceedings of Machine Learning Vertes,´ E., Gatsby Unit, U., and Sahani, M. Learning dou- Research, pp. 2741–2750. PMLR, 2019. bly intractable latent variable models via score matching. 2016. Salakhutdinov, R. Learning and evaluating boltzmann ma- chines. Utml Tr, 2:21, 2008. Vincent, P. A connection between score matching and de- noising autoencoders. Neural computation, 23(7):1661– Salakhutdinov, R. and Hinton, G. Deep Boltzmann ma- 1674, 2011. chines. In Proceedings of the twelfth international con- ference on artificial intelligence and statistics, 2009. Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th Saremi, S., Mehrjou, A., Scholkopf,¨ B., and Hyvarinen,¨ International Conference on Machine Learning, pp. 681– A. Deep energy networks. arXiv preprint 688. Omnipress, 2011. arXiv:1805.08306, 2018. Welling, M., Rosen-Zvi, M., and Hinton, G. E. Exponential Song, Y. and Ermon, S. Generative modeling by estimating family harmoniums with an application to information of the data distribution. In Advances in Neural retrieval. In Advances in Neural Information Processing Information Processing Systems 32, pp. 11895–11907, Systems 17, pp. 1481–1488, 2004. 2019. Xie, J., Lu, Y., Gao, R., Zhu, S.-C., and Wu, Y. N. Coopera- tive training of descriptor and generator networks. IEEE Song, Y. and Ermon, S. Improved techniques for training transactions on pattern analysis and machine intelligence, score-based generative models. In Advances in Neural 42(1):27–45, 2018. Information Processing Systems 33, 2020.

Song, Y., Garg, S., Shi, J., and Ermon, S. Sliced score matching: A scalable approach to density and score esti- mation. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, volume 115 of Proceedings of Machine Learning Research, pp. 574–584. AUAI Press, 2019.