Sliced Score Matching: a Scalable Approach to Density and Score Estimation
Total Page:16
File Type:pdf, Size:1020Kb
Sliced Score Matching: A Scalable Approach to Density and Score Estimation Yang Song∗ Sahaj Garg∗ Jiaxin Shi Stefano Ermon Stanford University Stanford University Tsinghua University Stanford University Abstract matching only depends on the scores, which are oblivious to the (usually) intractable partition functions. However, Score matching is a popular method for esti- score matching requires the computation of the diago- mating unnormalized statistical models. How- nal elements of the Hessian of the model’s log-density ever, it has been so far limited to simple, shal- function. This Hessian trace computation is generally low models or low-dimensional data, due to expensive (Martens et al., 2012), requiring a number of the difficulty of computing the Hessian of log- forward and backward propagations proportional to the density functions. We show this difficulty can data dimension. This severely limits its applicability to be mitigated by projecting the scores onto ran- complex models parameterized by deep neural networks, dom vectors before comparing them. This ob- such as deep energy-based models (LeCun et al., 2006; jective, called sliced score matching, only in- Wenliang et al., 2019). volves Hessian-vector products, which can be Several approaches have been proposed to alleviate this easily implemented using reverse-mode auto- difficulty: Kingma & LeCun (2010) propose approximate matic differentiation. Therefore, sliced score backpropagation for computing the trace of the Hessian; matching is amenable to more complex models Martens et al. (2012) develop curvature propagation, a and higher dimensional data compared to score fast stochastic estimator for the trace in score matching; matching. Theoretically, we prove the consis- and Vincent (2011) transforms score matching to a de- tency and asymptotic normality of sliced score noising problem which avoids second-order derivatives. matching estimators. Moreover, we demon- These methods have achieved some success, but may strate that sliced score matching can be used suffer from one or more of the following problems: incon- to learn deep score estimators for implicit dis- sistent parameter estimation, large estimation variance, tributions. In our experiments, we show sliced and cumbersome implementation. score matching can learn deep energy-based models effectively, and can produce accurate To alleviate these problems, we propose sliced score score estimates for applications such as varia- matching, a variant of score matching that can scale to tional inference with implicit distributions and deep unnormalized models and high dimensional data. arXiv:1905.07088v2 [cs.LG] 27 Jun 2019 training Wasserstein Auto-Encoders. The key intuition is that instead of directly matching the high-dimensional scores, we match their projections along random directions. Theoretically, we show that 1 INTRODUCTION under some regularity conditions, sliced score matching is a well-defined statistical estimation criterion that yields Score matching (Hyvärinen, 2005) is particularly suitable consistent and asymptotically normal parameter estimates. for learning unnormalized statistical models, such as en- Moreover, compared to the methods of Kingma & LeCun ergy based ones. It is based on minimizing the distance be- (2010) and Martens et al. (2012), whose implementations tween the derivatives of the log-density functions (a.k.a., require customized backpropagation for deep networks, scores) of the data and model distributions. Unlike maxi- sliced score matching only involves Hessian-vector prod- mum likelihood estimation (MLE), the objective of score ucts, thus can be easily and efficiently implemented in ∗ frameworks such as TensorFlow (Abadi et al., 2016) and Joint first authors. Correspondence to Yang Song <[email protected]> and Stefano Ermon <er- PyTorch (Adam et al., 2017). [email protected]>. Beyond training unnormalized models, sliced score between pd and pm(·; θ), which is defined as matching can also be naturally adapted as an objective for 1 estimating the score function of a data generating distri- L(θ) [ks (x; θ) − s (x)k2]: (1) , 2Epd m d 2 bution (Sasaki et al., 2014; Strathmann et al., 2015) by training a score function model parameterized by deep Since sm(x; θ) does not involve Zθ, the Fisher diver- neural networks. This observation enables many new gence does not depend on the intractable partition func- applications of sliced score matching. For example, we tion. However, Eq. (1) is still not readily usable for learn- show that it can be used to provide accurate score es- ing unnormalized models, as we only have samples and timates needed for variational inference with implicit do not have access to the score function of the data sd(x). distributions (Huszár, 2017) and learning Wasserstein By applying integration by parts, Hyvärinen (2005) shows Auto-Encoders (WAE, Tolstikhin et al. (2018)). that L(θ) can be written as L(θ) = J(θ) + C (cf ., Theo- Finally, we evaluate the performance of sliced score rem 1 in Hyvärinen (2005)), where matching on learning unnormalized statistical models 1 2 (density estimation) and estimating score functions of J(θ) p tr(rxsm(x; θ)) + ksm(x; θ)k ; (2) , E d 2 2 a data generating process (score estimation). For density estimation, experiments on deep kernel exponential fami- C is a constant that does not depend on θ, tr(·) denotes lies (Wenliang et al., 2019) and NICE flow models (Dinh the trace of a matrix, and et al., 2015) show that our method is either more scalable 2 or more accurate than existing score matching variants. rxsm(x; θ) = rx logp ~m(x; θ) (3) For score estimation, our method improves the perfor- is the Hessian of the log-density function. The constant mance of variational auto-encoders (VAE) with implicit can be ignored and the following unbiased estimator of encoders, and can train WAEs without a discriminator the remaining terms is used to train p~m(x; θ): or MMD loss by directly optimizing the KL divergence N 1 X 1 2 between aggregated posteriors and the prior. In both situ- J^(θ; xN ) tr(r s (x ; θ)) + ks (x ; θ)k ; 1 , N x m i 2 m i 2 ations we outperformed kernel-based score estimators (Li i=1 & Turner, 2018; Shi et al., 2018) by achieving better test where xN is a shorthand used throughout the paper for likelihoods and better sample quality in image generation. 1 a collection of N data points fx1; x2; ··· ; xN g sampled i.i.d. from pd, and rxsm(xi; θ) denotes the Hessian of 2 BACKGROUND logp ~m(x; θ) evaluted at xi. D Computational Difficulty. While the score matching Given i.i.d. samples fx1; x2; ··· ; xN g ⊂ R from a data ^ N distribution pd(x), our task is to learn an unnormalized objective J(θ; x1 ) avoids the computation of Zθ for un- density, p~m(x; θ), where θ is from some parameter space normalized models, it introduces a new computational Θ. The model’s partition function is denoted as Zθ, which difficulty: computing the trace of the Hessian of a log- 2 is assumed to be existent but intractable. Let pm(x; θ) be density function, rx logp ~m. A naïve approach of com- the normalized density determined by our model, we have puting the trace of the Hessian requires D times more backward passes than computing the gradient sm = Z p~m(x; θ) rx logp ~m (see Alg. 2 in the appendix). For example, the pm(x; θ) = ;Zθ = p~m(x; θ)dx: Zθ trace could be computed by applying backpropogation D 2 times to sm to get each diagonal term of rx logp ~m se- For convenience, we denote the score functions of quentially. In practice, D can be many thousands, which pm and pd as sm(x; θ) , rx log pm(x; θ) and can render score matching too slow for practical purposes. sd(x) , rx log pd(x) respectively. Note that since Moreover, Martens et al. (2012) argues from a theoretical log pm(x; θ) = logp ~m(x; θ) − log Zθ, we immediately perspective that it is unlikely that there exists an algorithm conclude that sm(x; θ) does not depend on the intractable for computing the diagonal of the Hessian defined by an partition function Zθ. arbitrary computation graph within a constant number of forward and backward passes. 2.1 SCORE MATCHING 2.2 SCORE ESTIMATION FOR IMPLICIT Learning unnormalized models with maximum likeli- DISTRIBUTIONS hood estimation (MLE) can be difficult due to the in- tractable partition function Zθ. To avoid this, score match- Besides parameter estimation in unnormalized models, ing (Hyvärinen, 2005) minimizes the Fisher divergence score matching can also be used to estimate scores of implicit distributions, which are distributions that have a Here v ∼ pv and x ∼ pd are independent, and we require | 2 tractable sampling process but without a tractable density. Epv [vv ] 0 and Epv [kvk2] < 1. Many examples of For example, the distribution of random samples from pv satisfy these requirements. For instance, pv can be a the generator of a GAN (Goodfellow et al., 2014) is an multivariate standard normal (N (0;ID)), a multivariate implicit distribution. Implicit distributions can arise in Rademacher distribution (the uniform distribution over many more situations such as the marginal distribution {±1gD), or a uniform distribution on the hypersphere of a non-conjugate model (Sun et al., 2019), and models SD−1 (recall that D refers to the dimension of x). defined by complex simulation processes (Tran et al., To eliminate the dependence of L(θ; p ) on s (x), we 2017). In many cases learning and inference become v d use integration by parts as in score matching (cf ., the intractable due to the need of optimizing an objective that equivalence between Eq. (1) and (2)). Defining involves the intractable density. | In these cases, directly estimating the score function J(θ; pv) , Epv Epd v rxsm(x; θ)v sq(x) = rx log qθ(x) based on i.i.d.