Nonparametric Score

Yuhao Zhou 1 Jiaxin Shi 1 Jun Zhu 1

Abstract clude parametric score matching (Hyvarinen¨ , 2005; Sasaki et al., 2014; Song et al., 2019), its denoising variants as Estimating the score, i.e., the of log den- autoencoders (Vincent, 2011), nonparametric score match- sity function, from a set of samples generated by ing (Sriperumbudur et al., 2017; Sutherland et al., 2018), an unknown distribution is a fundamental task and kernel score estimators based on Stein’s methods (Li in inference and learning of probabilistic mod- & Turner, 2018; Shi et al., 2018). They have been success- els that involve flexible yet intractable densities. fully applied to applications such as estimating of Kernel estimators based on Stein’s methods or mutual information for representation learning (Wen et al., score matching have shown promise, however 2020), score-based generative modeling (Song & Ermon, their theoretical properties and relationships have 2019; Saremi & Hyvarinen, 2019), gradient-free adaptive not been fully-understood. We provide a unifying MCMC (Strathmann et al., 2015), learning implicit mod- view of these estimators under the framework of els (Warde-Farley & Bengio, 2016), and solving intractabil- regularized nonparametric regression. It allows ity in approximate inference algorithms (Sun et al., 2019). us to analyse existing estimators and construct new ones with desirable properties by choosing Recently, nonparametric score estimators are growing in different hypothesis spaces and regularizers. A popularity, mainly because they are flexible, have well- unified convergence analysis is provided for such studied statistical properties, and perform well when sam- estimators. Finally, we propose score estimators ples are very limited. Despite a common goal, they have based on iterative regularization that enjoy com- different motivations and expositions. For example, the putational benefits from curl-free kernels and fast work Sriperumbudur et al.(2017) is motivated from the convergence. density estimation perspective and the richness of kernel ex- ponential families (Canu & Smola, 2006; Fukumizu, 2009), where the is obtained by score matching. Li & 1. Introduction Turner(2018) and Shi et al.(2018) are mainly motivated by Stein’s methods. The solution of Li & Turner(2018) Intractability of density functions has long been a central gives the score prediction at sample points by minimizing challenge in probabilistic learning. This may arise from the kernelized Stein discrepancy (Chwialkowski et al., 2016; various situations such as training implicit models like Liu et al., 2016) and at an out-of-sample point by adding it GANs (Goodfellow et al., 2014), or marginalizing over a to the training data, while the estimator of Shi et al.(2018) non-conjugate hierarchical model, e.g., evaluating the out- is obtained by a spectral analysis in function space. put density of stochastic neural networks (Sun et al., 2019). In these situations, inference and learning often require eval- As these estimators are studied in different contexts, uating such intractable densities or optimizing an objective their relationships and theoretical properties are not fully- understood. In this paper, we provide a unifying view of arXiv:2005.10099v2 [stat.ML] 30 Jun 2020 that involves them. them under the regularized nonparametric regression frame- Among various solutions, one important family of meth- work. This framework allows us to construct new estimators ods are based on score estimation, which rely on a key with desirable properties, and to justify the consistency and step of estimating the score, i.e., the derivative of the improve the convergence rate of existing estimators. It also log density ∇x log p(x) from a set of samples drawn from allows us to clarify the relationships between these estima- some unknown probability density p. These methods in- tors. We show that they differ only in hypothesis spaces and regularization schemes. 1Dept. of Comp. Sci. & Tech., BNRist Center, Institute for AI, Tsinghua-Bosch ML Center, Tsinghua University. Correspondence Our contributions are both theoretical and algorithmic: to: J. Zhu .

Proceedings of the 37 th International Conference on Machine • We provide a unifying perspective of nonparametric Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by score estimators. We show that the major distinction of the author(s). the KEF estimator (Sriperumbudur et al., 2017) from Nonparametric Score Estimators

the other two estimators lies in the use of curl-free 2.1. Vector-Valued Learning kernels, while Li & Turner(2018) and Shi et al.(2018) Supervised vector-valued learning amounts to learning a differ mostly in regularization schemes, with the for- vector-valued function f : X → Y from a training set mer additionally ignores a one-dimensional subspace z z = {(xm, ym)} , where X ⊆ d, Y ⊆ q. Here we in the hypothesis space. We provide a unified conver- m∈[M] R R assume the training data is sampled from an unknown distri- gence analysis under the framework. bution ρ(x, y), which can be decomposed into ρ(y|x)ρX (x). • We justify the consistency of the Stein gradient es- A criterion for evaluating such an estimator is the mean 2 timator (Li & Turner, 2018), although the originally squared error (MSE) E(f) := Eρ(x,y) kf(x) − yk2. It proposed out-of-sample extension is heuristic and ex- is well-known that the conditional expectation fρ(x) := pensive. We provide a natural and principled out-of- Eρ(y|x)[y] minimizes E. In practice, we minimize the empir- sample extension derived from our framework. For 1 PM m m 2 ical error Ez(f) := M m=1 kf(x ) − y k2 in a certain both approaches we provide explicit convergence rates. hypothesis space F. However, the minimization problem • From the convergence analysis we also obtain the ex- is typically ill-posed for large F. Hence, it is convenient to plicit rate for Shi et al.(2018), which can be shown to consider the regularized problem: improve the error bound of Shi et al.(2018). 2 fz,λ := arg min Ez(f) + λ kfkF , (1) • Our results suggest favoring curl-free estimators in f∈F

high dimensions. To address the scalability challenge, where k·kF is the norm in F. In the vector-valued case, it we propose iterative score estimators by adopting the is typical to consider a vector-valued RKHS HK associated ν-method (Engl et al., 1996) as the regularizer. We with a matrix-valued kernel K as the hypothesis space. Then PM m show that the structure of curl-free kernels can further the estimator is fz,λ = m=1 Kxm c , where Kxm denotes accelerate such algorithms. Inspired by a similar idea, m m 1 the function K(x , ·). c solves the linear system ( M K + we propose a conjugate gradient solver of KEF that is 1 i j 1 M λI)c = M y with Kij = K(x , x ), c = (c , ··· , c ), y = significantly faster than previous approximations. (y1, ··· , yM ).

For convenience, we define the sampling operator Sx : Notation We always assume ρ is a probability mea- Mq 1 M HK → R as Sx(f) := (f(x ), ··· , f(x )). Its ad- ∗ Mq sure with probability density function p(x) supported S : → H hS (f), i Mq = joint x R K that satisfies x c R on X ⊂ d, and L2(X , ρ; d) is the Hilbert space ∗ Mq ∗ 1 M R R hf, Sx (c)iHK , ∀f ∈ HK, c ∈ R is Sx (c , ··· , c ) = d M m 1 ∗ 1 ∗ of all square integrable functions f : X → R with P m m=1 Kx c . Since ( M Sx Sx + λI)fz,λ = M Sx Kc + 1 inner product hf, giL2(X ,ρ; d) = Ex∼ρ[hf(x), g(x)i d ]. ∗ ∗ R R λSx c = M Sx y, the estimator now can be written as −1 We denote by h·, ·iρ and k·kρ the inner product and 1 ∗  1 ∗ f ,λ = S S + λI S y. In fact, if we consider 2 d z M x x M x the norm in L (X , ρ; ), respectively. We denote k 2 R the data-free limit of (1): arg min E(f) + λ kfk , as a scalar-valued kernel, and K as a matrix-valued f∈HK HK d×d the minimizer is unique when λ > 0 and is given by kernel K : X × X → R satisfying the following −1 fλ := (LK +λI) LKfρ, where LK : HK → HK is the in- conditions: (1) K(x, x0) = K(x0, x)T for any x, x0 ∈ X ; R Pm T tegral operator defined as LKf := X Kxf(x)dρX (Smale (2) i,j=1 ci K(xi, xj)cj ≥ 0 for any {xi} ⊂ X and 1 ∗ d & Zhou, 2007). It turns out that M Sx Sx is an empirical es- {ci} ⊂ R . We denote a vector-valued reproducing kernel ˆ 1 PM m 1 ∗ timate of LK: LKf := m=1 Kxm f(x ) = Sx Sxf. Hilbert space (RKHS) associated to K by HK, which is the M M m ˆ 1 ∗ P d It can also be shown that LKfρ = Sx y. Hence, we can closure of i=1 K(xi, ·)ci : xi ∈ X , ci ∈ R , m ∈ N M ˆ −1 ˆ under the norm induced by the inner product write fz,λ = (LK + λI) LKfρ. T hK(xi, ·)ci, K(xj, ·)sji := ci K(xi, xj)sj. We define As we have mentioned, the role of regularization is to deal Kx := K(x, ·) and [M] := {1, ··· ,M} for M ∈ Z+. For ˆ s×t with the ill-posedness. Specifically, LK is not always in- A1, ··· , An ∈ R , we use (A1, ··· , An) to represent ns×t vertible as it has finite rank and HK is usually of infinite a block matrix A ∈ R with A(i−1)s+j,k being dimension. Many regularization methods are studied in the the (j, k)-th component of Ai, and we similarly define T T T context of solving inverse problems (Engl et al., 1996) and [A1, ··· , An] := (A1 , ··· , An) . statistical learning theory (Bauer et al., 2007). The regular- ization method we presented in (1) is the famous Tikhonov 2. Background regularization, which belongs to a class of regularization techniques called spectral regularization (Bauer et al., 2007). In this section, we briefly introduce the nonparametric re- Specifically, spectral regularization corresponds to a family gression method of learning vector-valued functions (Bal- of estimators defined as dassarre et al., 2012). We also review existing kernel-based g ˆ ˆ approaches to score estimation. fz,λ := gλ(LK)LKfρ, Nonparametric Score Estimators

+ ˆ where gλ : R → R is a regularizer such that gλ(LK) due to the large linear system of size Md × Md. Suther- approximates the inverse of LˆK. Note that LˆK can be de- land et al.(2018) proposed to use the Nystr om¨ method to P composed into σi hei, ·i ei, where (σi, ei) is a pair of accelerate KEF. Instead of minimizing the loss in the whole eigenvalue and eigenfunction, we can define gλ(LˆK) := RKHS, they minimized it in a low dimensional subspace. P gλ(σi) hei, ·i ei. The Tikhonov regularization corre- sponds to g (σ) = (λ + σ)−1. There are several different λ Stein Gradient Estimator The Stein gradient estimator regularizers. For example, the spectral cut-off regularizer proposed by Li & Turner(2018) is based on inverting the is defined by g (σ) = σ−1 for σ ≥ λ and g (σ) = 0 other- λ λ following generalized Stein’s identity (Stein, 1981; Gorham wise. We refer the readers to Smale & Zhou(2007); Bauer & Mackey, 2015) et al.(2007); Baldassarre et al.(2012) for more details.

T 2.2. Related Work Ep[h(x)∇ log p(x) + ∇h(x)] = 0, (4) We assume log p(x) is differentiable, and define the score where h : X → d is a test function satisfying some regu- as s := ∇ log p. By score estimation we aim to estimate R p larity conditions. An empirical approximation of the iden- s from a set of i.i.d. samples {xm} drawn from ρ. p m∈[M] tity is − 1 HS ≈ ∇ h, where H = (h(x1), ··· , h(xM )) ∈ There have been many kernel-based score estimators studied M x Rd×M , S = (∇ log p(x1), ··· , ∇ log p(xM )) ∈ RM×d and in different contexts (Sriperumbudur et al., 2017; Sutherland 1 PM m ∇ h = ∇ m h(x ). Li & Turner(2018) pro- et al., 2018; Li & Turner, 2018; Shi et al., 2018). Below we x M m=1 x 1 2 η 2 give a brief review of them. posed to minimize ∇xh + M HS F + M 2 kSkF to esti- mate S, where k·kF denotes the Frobenius norm. The kernel i j i T j Kernel Exponential Family Estimator The kernel ex- trick k(x , x ) := h(x ) h(x ) was then exploited to obtain ponential family (KEF) (Canu & Smola, 2006; Fukumizu, the estimator. From the above we only have score estimates 2009) was originally proposed as an infinite-dimensional at the sample points. Li & Turner(2018) proposed a heuris- m generalization of exponential families. It was shown to be tic out-of-sample extension at x by adding it to {x } and useful in density estimation as it can approximate a broad recompute the minimizer. Such an approach is unjustified. class of densities arbitrarily well (Sriperumbudur et al., It is still unclear whether the estimator is consistent. 2017). The KEF is defined as: f(x)−A(f) A(f) Pk := {pf (x) = e : f ∈ Hk, e < ∞}, Spectral Stein Gradient Estimator The Spectral Stein Gradient Estimator (SSGE) (Shi et al., 2018) was derived by where Hk is a scalar-valued RKHS, and A(f) := a spectral analysis of the score function. Unlike Li & Turner log R ef(x)dx A(f) X is the normalizing constant. Since is (2018) it was shown to have convergence guarantees and typically intractable, Sriperumbudur et al.(2017) proposed principled out-of-sample extension. The idea is to expand to estimate f by matching the model score ∇ log pf and the each component of the score in a scalar-valued function data score sp, thus the KEF can naturally be used for score 2 P∞ space L (X , ρ): gi(x) = βijψj(x), where gi is the estimation (Strathmann et al., 2015). This approach works j=1 i-th component of the score and {ψj} are the eigenfunc- by minimizing the regularized score matching loss: R tions of the operator Lkf := X k(x, ·)f(x)dρ(x) min J(pkp ) + λ kfk2 , (2) associated with a scalar-valued kernel k. By using the f Hk f∈Hk Stein’s identity in (4) Shi et al.(2018) showed that 2 βij = − ρ[∂iψj(x)]. The Nystrom¨ method (Baker, 1977; where J(pkq) := Ep k∇ log p − ∇ log qk2 is the Fisher di- E vergence between p and q. Integration by parts was used to Williams & Seeger, 2001) was then used to estimate ψj: eliminated ∇ log p from J(pkq) (Hyvarinen¨ , 2005) and the √ M exact solution of (2) was given as follows (Sriperumbudur M X ψˆ (x) = k(x, xm)w , (5) et al., 2017, Theorem 5): j λ jm j m=1 M d X X ξˆ fˆ = c ∂ k(xm, ·) − , (3) p,λ (m−1)d+j j λ where {xm} are i.i.d. samples drawn from ρ, w is m=1 j=1 m∈[M] jm the m-th component of the eigenvector that corresponds to ˆ 1 PM Pd 2 m Md where ξ(x) = M m=1 j=1 ∂j k(x , ·), c ∈ R its j-th largest eigenvalue of the kernel matrix constructed m is obtained by solving (G + MλI)c = b/λ with from {x }m∈[M]. The final estimator was obtained by m ` J G(m−1)d+i,(`−1)d+j = ∂i∂j+dk(x , x ) and b(m−1)d+i = P ˆ truncating gi to j=1 βijψj(x) and plugging in ψj. Shi 1 PM Pd 2 m ` M `=1 j=1 ∂i∂j+dk(x , x ), where ∂j+d denotes tak- et al.(2018, Theorem 2) provided an error bound of SSGE ing derivative w.r.t. the j-th component of the second param- depending on J and M. However, the convergence rate is eter x`. This solution suffers from computational drawbacks still unknown. Nonparametric Score Estimators

g 3. Nonparametric Score Estimators Theorem 3.1 (Tikhonov Regularization). Let sˆp,λ be de- fined as in (8), and g (σ) = (σ + λ)−1. Then The kernel score estimators discussed in Sec. 2.2 were pro- λ posed in different contexts. The KEF estimator is motivated g ˆ sˆ (x) = KxXc − ζ(x)/λ, (9) from the density estimation perspective, while Stein and p,λ SSGE have no explicit density models. SSGE relies on where c is obtained by solving spectral analysis in the function space, while the other two are derived by minimizing a loss function. Despite sharing a (K + MλI)c = h/λ. (10) common goal, it is still unclear how these estimators relate Here c ∈ Md, h = (ζˆ(x1), ··· , ζˆ(xM )) ∈ Md, K = to each other. In this section, we present a unifying frame- R R xX [K(x, x1), ··· , K(x, xM )] ∈ d×Md, and K ∈ Md×Md is work of score estimation using regularized vector-valued R R given by K = K(xm, x`) . regression. We show that several existing kernel score esti- (m−1)d+i,(`−1)d+j ij mators are special cases under the framework, which allows The proof is given in appendix C.4, where the general repre- us to thoroughly investigate their strengths and weaknesses. senter theorem (Sriperumbudur et al., 2017, Theorem A.2) is used to show that the solution lies in the subspace generated 3.1. A Unifying Framework by d ˆ {Kxm cm : m ∈ [M], cm ∈ R } ∪ {ζ}. (11) As introduced in Sec. 2.2, the goal is to estimate the score sp from a set of i.i.d. samples {xm} drawn from ρ. We m∈[M] Unlike the Tikhonov regularizer that shifts all eigenval- first consider the ideal case where we have the ground truth ues simultaneously, the spectral cut-off regularization sets values of s at the sample locations. Then we can estimate p g (σ) = σ−1 for σ ≥ λ, and g (σ) = 0 otherwise. To s with vector-valued regression as described in Sec. 2.1: λ λ p obtain such estimator, we need the following lemma that M relates the spectral properties of K and LˆK. 1 X λ sˆ = arg min ks( m) − s ( m)k2 + ksk2 . 1 p,λ x p x 2 HK Lemma 3.2. Let σ be a non-zero eigenvalue of K such s∈H M 2 M K m=1 that 1 Ku = σu, where u ∈ Md is the unit eigenvector. (6) M R −1 Then σ is an eigenvalue of LˆK and the corresponding unit The solution is given by sˆp,λ = (LˆK + λI) LˆKsp. We could replace the Tikhonov regularizer with other spectral eigenfunction is regularization, for which the general solution is M 1 X (m) v = √ Kxm u , sˆg := g (Lˆ )Lˆ s . p,λ λ K K p (7) Mσ m=1

1:M (1) (M) (i) d In reality, the values of sp at x are unknown and we can- where u is splitted into (u , ··· , u ) and u ∈ R . ˆ 1 PM m not compute LKsp as M m=1 Kxm sp(x ). Fortunately, we could exploit integration by parts to avoid this problem. The lemma is a direct generalization of Rosasco et al.(2010, Under some mild regularity conditions (Assumptions B.1- Proposition 9) to vector-valued operators. g B.3), we have Theorem 3.3 (Spectral Cut-Off Regularization). Let sˆp,λ be defined as in (8), and L s = [K ∇ log p(x)] = − [div KT], K p Eρ x Eρ x x ( σ−1 σ > λ, T where the divergence of K is defined as a vector-valued gλ(σ) = x 0 σ ≤ λ. function, whose i-th component is the divergence of the T i-th column of K . The empirical estimate LˆKsp is then x Let (σj, uj)j≥1 be the eigenvalue and eigenvector pairs that 1 PM T available as − divxm K m , which leads to the fol- 1 M m=1 x satisfy M Kuj = σjuj. Then we have lowing general formula of nonparametric score estimators:   T g ˆ ujuj sˆ = −gλ(LˆK)ζ, (8) g X p,λ sˆp,λ(x) = −KxX  2  h, (12) Mσj σj ≥λ ˆ 1 PM T where ζ := M m=1 divxm Kxm . where KxX and h are defined as in Theorem 3.1. 3.2. Regularization Schemes Apart from the above methods with closed-form solutions, We now derive the final form of the estimator under three early stopping of iterative solvers like gradient descent can regularization schemes (Bauer et al., 2007). The choice also play the role of regularization (Engl et al., 1996). Iter- of regularization will impact the convergence rate of the ative methods replace the expensive inversion or eigende- estimator, which will be studied in Sec.4. composition of the Md × Md size kernel matrix with fast Nonparametric Score Estimators matrix-vector multiplication. In Sec. 3.5 we show that such eigendecomposition is the same as in the scalar-valued case. methods can be further accelerated by utilizing the structure On the other hand, the independence assumption may not of our kernel matrix. hold for score functions, whose output dimensions are cor- related as they form the gradient of the log density. As we We consider two iterative methods: the Landweber itera- shall see in Sec.5, such misspecification of the hypothesis tion and the ν-method (Engl et al., 1996). The Landweber ˆ space degrades the performance in high dimensions. iteration solves LˆKsp = −ζ with the fixed-point iteration: Curl-Free Kernels Noticing that score vector fields are (t+1) (t) ˆ ˆ (t) sˆp :=s ˆp − η ζ + LKsˆp , (13) gradient fields, we can use curl-free kernels (Fuselier Jr, 2007; Macedoˆ & Castro, 2010) to capture this property. where η is a step-size parameter. It can be regarded as using A curl-free kernel can be constructed from the negative the following regularization: Hessian of a translation-invariant kernel k(x, y) = φ(x − y): (k) 2 2 g Kcf (x, y) := −∇ φ(x − y), where φ : X → ∈ C . It Theorem 3.4 (Landweber Iteration). Let sˆp,λ and sˆp be R is easy to see that K is positive definite. A nice property defined as in (8) and (13), respectively. Let sˆ(0) = 0 and cf Pt−1 i −1 of HKcf is that any element in it is a gradient of some gλ(σ) = η i=0(1 − ησ) , where t := bλ c. Then, function. To see this, notice that the j-th column of Kcf is g (t) ˆ −∇(∂jφ) and each element in HKcf is a linear combination sˆp,λ =s ˆp = −tηζ + KxXct, of columns of Kcf . We also note that the unnormalized log- 2 where c0 = 0, ct+1 = (Id − ηK/M)ct − tη h/M, and density function can be recovered from the estimated score K, KxX, h are defined as in Theorem 3.1. when using curl-free kernels (see appendix C.3). The cost of inversion and eigendecomposition of the kernel matrix is The Landweber iteration often requires a large number of O(M 3d3), compared to O(M 3) for diagonal kernels. iterations. An accelerated version of it is the ν-method, where ν is a parameter controlling the maximal convergence 3.4. Examples rate (see Sec.4). The regularizer of the ν-method can be In the following we provide examples of nonparametric represented by a family of polynomials gλ(σ) = poly(σ). These polynomials approximate 1/σ better than those in score estimators derived from the framework. We show that the Landweber iteration. As a result, the ν-method only existing estimators can be recovered with certain types of −1/2 kernels and regularization schemes (Table1). requires a polynomial of degree bλ c to define gλ, which significantly reduces the number of iterations (Engl et al., Example 3.5 (KEF). Consider using curl-free kernels for 1996; Bauer et al., 2007). The next iterate of the ν-method the Tikhonov regularized estimator in (9). By substituting 2 can be generated by the current and the previous ones: Kcf (x, y) = −∇ φ(x − y) for K, we get (t+1) (t) (t) (t−1) ˆ ˆ (t) sˆp =s ˆp + ut(ˆsp − sˆp ) − ωt(ζ + LKsˆp ), M d ˆ X X ζcf (x) sˆg (x) = − c ∇∂ φ(x − xm) − , p,λ (m−1)d+j j λ where ut, ωt are carefully chosen constants (Engl et al., m=1 j=1 1996, Algorithm 6.13). We describe the full algorithm in Example C.4 (appendix C.4.3). 1 PM Pd 2 m where ζcf (x)i := − M m=1 j=1 ∂i∂j φ(x−x ). Notic- ing that Kcf (x, y)ij = −∂i∂jφ(x − y) = ∂i∂j+dk(x, y), we 3.3. Hypothesis Spaces could check that the definition of c here, which follows from (9), is the same as in (3). Thus by comparing with (3), In this framework, the hypothesis space is characterized by we have the matrix-valued kernel that induces the RKHS (Alvarez et al., 2012). Below we discuss two choices of the kernel: g ˆ sˆ (x) = ∇fp,λ(x) = ∇ log p ˆ (x). (14) the diagonal ones are computationally more efficient, while p,λ fp,λ curl-free kernels capture the conservative property of score Therefore, the KEF estimator is equivalent to choosing curl- vector fields. free kernels and the Tikhonov regularization in (8). Diagonal Kernels The simplest way to define a diago- We note that, although the solutions are equivalent, the nal matrix-valued kernel is K(x, y) = k(x, y)Id, where space {∇ log p , f ∈ H } looks different from the curl-free k : X × X → R is a scalar-valued kernel. This induces f k d d RKHS constructed from the negative Hessian of k. Such a product RKHS Hk := ⊗i=1Hk where all output dimen- sions of a function are independent. In this case the kernel equivalence of regularized minimization problems may be 1 M of independent interest. matrix for X = (x ,..., x ) is K = k(X, X) ⊗ Id, where k(X, X) denotes the Gram matrix of the scalar-valued kernel Example 3.6 (SSGE). For the estimator (12) obtained k. Therefore, the computational cost of matrix inversion and from the spectral cut-off regularization. Consider letting Nonparametric Score Estimators

K(x, y) = k(x, y)Id. Then K = k(X, X) ⊗ Id, and it can be Table 1. Existing nonparametric score estimators, their kernel PM Pd T T types, and regularization schemes. φ is from k(x, y) = φ(x − y). decomposed as m=1 i=1 λm(wmwm ⊗ eiei ), where {(λm, wm)} is the eigenpairs of k(X, X) with λ1 ≥ λ2 ≥ d ALGORITHM KERNEL REGULARIZER · · · ≥ λM and {ei} is the standard basis of . The estima- R −1 SSGE k(x, y)Id 1{σ≥λ}σ tor reduces to −1 Stein k(x, y)Id 1{σ>0}(λ + σ)   2 −1 T KEF −∇ φ(x − y)(λ + σ) wjw 2 −1 g X j NKEF −∇ φ(x − y) 1{σ>0}(λ + σ) sˆp,λ(x)i = −k(x, X)  2  ri, (15) λj λj ≥λ Interestingly, we get further acceleration by utilizing the 1 where M ri := (hi, hd+i, ··· , h(M−1)d+i). When we structure of curl-free kernels. choose λ = λJ , simple calculations (see appendix C.2) g Example 3.8 (Iterative curl-free estimators). We observe show that (15) equals the SSGE estimator sˆp,λ(x)i = that when using a curl-free kernel Kcf constructed from a 1 PJ PM ˆ m ˆ ˆ − M j=1 m=1 ∂iψj(x )ψj(x), where ψj is defined as radial scalar-valued kernel k(x, y) = φ(kx − yk), in (5). Therefore, SSGE is equivalent to choosing the di-  0 00  0 agonal kernel K( , ) = k( , ) and the spectral cut-off φ φ T φ x y x y Id Kcf (x, y) = − rr − I, regularization in (8). r3 r2 r

Example 3.7 (Stein). We consider modifying the Tikhonov where r = x − y, r = krk2. Consider in matrix-vector −1 d regularizer to gλ(σ) = (λ + σ) 1{σ>0}. In this case, we multiplications, for a vector a ∈ R , Kcf (x, y)a can be g −1 1 −1  φ0 φ00  T φ0 obtain an estimator sˆp,λ(x) = −KxXK ( M K + λI) h computed as r3 − r2 (r a)r− r a, where only a vector- by Lemma C.2. At sample points, the estimated score is vector multiplication is required with time complexity O(d), 1 −1 2 2 −( M K + λI) h, which coincides with the Stein gradient compared to general O(d ). Thus, we only need O(M d) estimator. This suggests a principled out-of-sample exten- Md time to compute Kb for any b ∈ R . In practice, we sion of the Stein gradient estimator. only need to store samples for computing Kb instead of To gain more insights, we consider to minimize (6) in the constructing the whole kernel matrix. This reduces the d memory usage from O(M 2d2) to O(M 2d). subspace generated by {Kxm cm : m ∈ [M], cm ∈ R }. Compared with (11), the one-dimensional subspace ζˆ is R We note that the same idea in Example 3.8 can be used ignored. We could check that (in appendix C.2) this is to accelerate the KEF estimator if we adopt the conjugate equivalent to exploiting the previous mentioned regular- gradient methods (Van Loan & Golub, 1983) to solve (10), izer. Therefore, the Stein estimator is equivalent to using because we have shown that the KEF estimator is equiva- the diagonal kernel K(x, y) = k(x, y)I and the Tikhonov d lent to our Tikhonov regularized estimators with curl-free regularization with a one-dimensional subspace ignored. kernels. As we shall see in experiments, this method is All the above examples can be extended to use a subset of extremely fast in high dimensions. the samples with Nystrom¨ methods (Williams & Seeger, 2001). Specifically, we can modify the general formula in 4. Theoretical Properties g,Z ˆ ˆ (8) as sˆp,λ = −gλ(PZLKPZ)PZζ, where PZ : HK → HK is the projection onto a low-dimensional subspace gener- In this section, we give a general theorem on the conver- ated by the subset Z. When the curl-free kernel and the gence rate of score estimators in our framework, which same truncated Tikhonov regularizer as in Example 3.7 provides a tighter error bound of SSGE (Shi et al., 2018). are used, this estimator is equivalent to the Nystrom¨ KEF We also investigate the case where samples are corrupted by (NKEF) (Sutherland et al., 2018). More details can be found a small set of points, and provide the convergence rate of the in appendix C.1. heuristic out-of-sample extension proposed in Li & Turner (2018). Proofs and assumptions are deferred to appendixB. 3.5. Scalability First, we follow Bauer et al.(2007); Baldassarre et al.(2012) When using curl-free kernels, we need to deal with an to characterize the regularizer. Md × Md matrix. In such cases, the Tikhonov and the Definition 4.1 (Bauer et al.(2007)) . We say a family 3 3 2 2 spectral cut-off regularization cost O(M d ) and have dif- of functions gλ : [0, κ ] → R, 0 < λ ≤ κ is ficulties scaling with the sample size and the input dimen- a regularizer if there are constants B, D, γ such that sions. Fortunately, as the unifying perspective suggests, sup0<σ≤κ2 |σgλ(σ)| ≤ D, sup0<σ≤κ2 |gλ(σ)| ≤ B/λ and we could modify the regularization schemes with iterative sup0<σ≤κ2 |1−σgλ(σ)| ≤ γ. The qualification of gλ is the r r methods that only require matrix-vector multiplications, e.g., maximal r such that sup0<σ≤κ2 |1 − σgλ(σ)|σ ≤ γrλ , the Landweber iteration and the ν-method (see Sec. 3.2). where γr does not depend on λ. Nonparametric Score Estimators

Now, we can use the idea of Bauer et al.(2007, Theorem Next, we consider the case where estimators are not obtained 10) to obtain an error bound of our estimator. from i.i.d. samples. Specifically, we consider how the Theorem 4.2. Assume Assumptions B.1-B.5 hold. Let r¯ be convergence rate is affected when our data is the mixture of g a set of i.i.d. samples and a set of fixed points. the qualification of the regularizer gλ, and sˆp,λ be defined as r Theorem 4.6. Under the same assumption of Theorem 4.2, in (8). Suppose there exists f0 ∈ HK such that sp = LKf0 − 1 −1 for some r ∈ [0, r¯]. Then we have for λ = M 2r+2 , we define gλ(σ) := (λ + σ) , and choose Z := n m {z }n∈[N] ⊆ X . Let Y := {y }m∈[M] be a set of i.i.d. g r − 2r+2  samples drawn from ρ, and sˆp,λ,Z be defined as in (8) with ksˆp,λ − spkHK = Op M , X = Z ∪ Y. Suppose N = O(M α), then we have for − 1 and for r ∈ [0, r¯ − 1/2], we have λ = M 2r+2 , − r  α− r r+1/2 2r+2 r+1 g   sup ksˆp,λ,Z − spkHK = Op M + O(M ), − 2r+2 ksˆp,λ − spkρ = Op M , Z n where the supZ is taken over all {z }n∈[N] ⊆ X . where Op is the Big-O notation in probability. Proof Outline. Define T := 1 S∗S , where S is the Note the qualification impacts the maximal convergence Z N Z Z Z rate. As the qualification of Tikhonov regularization is 1, sampling operator. Let sˆp,λ be defined as in (8) with ˆ from the error bound, we observe the well-known saturation X = Y. We can write the estimator as sˆp,λ,Z := gλ(LK + ˆ N ˆ phenomenon of Tikhonov regularization (Engl et al., 1996), RZ)(LK + RZ)sp, where RZ := M+N (TZ − LK), and ˆ ˆ ˆ i.e., the convergence rate does not improve even if sp = bound ksˆp,λ,Z−sˆp,λk by k(gλ(LK+RZ)−gλ(LK))LKspk+ r ˆ LKf0 for r > 1. To alleviate this, we can choose the kgλ(LK + RZ)RZspk. It can be shown that the first term is regularizer with a larger qualification. For example, the O NM −1λ−2, and the second term is O NM −1λ−1. spectral cut-off regularization and the Landweber iteration Combining these with Theorem 4.2, we finish the proof. have qualification ∞, and the ν-method has qualification ν. This suggests that the ν-method is appealing as it has From Theorem 4.6, we see that the convergence rate is r a smaller iteration number than the Landweber iteration not affected when data is corrupted by at most O(M 2r+2 ) and a better maximal convergence rate than the Tikhonov points. Under the same notation of this theorem, the out-of- regularization. sample extension of the Stein estimator proposed in Li & Remark 4.3 (Stein). The consistency and convergence rate Turner(2018) can be written as sˆp,λ,x(x), which corrupts of the Stein estimator and its out-of-sample extension sug- the i.i.d. data by a single test point. Then we can obtain the gested in Example 3.7 follow from Theorem 4.2. The rate following bound for this estimator. −θ1  in k·kρ is Op M , where θ1 ∈ [1/4, 1/3]. The conver- Corollary 4.7. With the same assumptions and notations of gence rate of the original out-of-sample extension in Li & Theorem 4.6, we have

Turner(2018) will be given in Corollary 4.7. r − 2r+2 sup ksˆp,λ,x(x) − sp(x)k2 = Op(M ). Remark 4.4 (SSGE). From Theorem 4.2, the convergence x∈X −θ2 rate in k·kρ of SSGE is Op(M ), where θ2 ∈ [1/4, 1/2), which improves Shi et al.(2018, Theorem 2). To see this, 5. Experiments we assume the eigenvalues of LK are µ1 > µ2 > ··· and −β they decay as µJ = O(J ). The error bound provided by We evaluate our estimators on both synthetic and real data. Shi et al.(2018) is In Sec. 5.1, we consider a challenging grid distribution as described in the experiment of Sutherland et al.(2018) to  2  2 J test the accuracy of nonparametric score estimators in high ksˆp,λ − spkρ = Op 2 + µJ . µJ (µJ − µJ+1) M dimensions and out-of-sample points, In Sec. 5.2 we train Wasserstein autoencoders (WAE) with score estimation and 1 4(β+1) compare the accuracy and the efficiency of different estima- We can choose J = M to obtain ksˆp,λ − spkρ = 1 − β tors. We mainly compare the following score estimators : Op(M 8(β+1) ). The convergence rate is slower than −1/4 Op(M ), the worst case of Theorem 4.2. Existing nonparametric estimators: Stein (Li & Turner, 2018), SSGE (Shi et al., 2018), KEF (Sriperumbudur et al., Remark 4.5 (KEF). Compared with Theorem 7(ii) in Sripe- 2017), and its low rank approximation NKEF (Sutherland rumbudur et al.(2017), where they bound the Fisher diver- α et al., 2018), where α represents to use αM/10 samples. gence, which is the square of the L2-norm in our Theo- rem 4.2, we see that the two results are exactly the same. 1Code is available at https://github.com/miskcoo/ −θ3  The rate in this norm is Op M , where θ3 ∈ [1/4, 1/3]. kscore. Nonparametric Score Estimators

Stein

e Stein 0.8 e 0.35 Table 2. Negative log-likelihoods on MNIST datasets and per c c

n SSGE

n SSGE a a t KEF-CG t 0.30 KEF-CG s epoch time on 128 latent dimension. All models are timed on s i i

d 0.6 NKEF d NKEF 4 0.25 4 2 2 ℓ ℓ

GeForce GTX TITAN X GPU. ν-Method ν-Method d d 0.20 e 0.4 e z z i i l l a

a 0.15 m m LATENT DIM 8 32 64 128 TIME r r 0.2

o 0.10 o n n 0.05 STEIN 97.15 92.10 101.60 114.41 4.2S 50 100 150 200 50 100 150 200 S dimension dimension SSGE 97.24 92.24 101.92 114.57 9.2 KEF 97.07 90.93 91.58 92.40 201.1S (a) M = 128 (b) M = 512 NKEF2 97.71 92.29 92.82 94.14 36.4S

0 NKEF4 97.59 91.19 91.80 92.94 97.5S 10 Stein 101 Stein e e c c

n SSGE n SSGE NKEF8 97.23 90.86 92.39 92.49 301.2S a a

t KEF-CG t KEF-CG s s i i

d −1 NKEF d NKEF 10 4 4 2 2 0 KEF-CG 97.39 90.77 92.66 92.05 13.7S ℓ ℓ 10 ν-Method ν-Method d d

e e ν-METHOD 97.28 90.94 91.48 92.10 78.1S z z i i l l a 10−2 a m m r r −1 SSM 96.98 89.06 93.06 96.92 6.0S o o 10 n n

100 200 300 400 100 200 300 400 samples samples is lower than that of curl-free kernels. This suggests favoring (c) d = 16 (d) d = 128 the diagonal kernels in low dimensions. Possibly because 2 this dataset does not make the convergence rate saturate, Figure 1. Normalized distance E[ksp − sˆp,λk2]/d on grid data. In the first row, M is fixed and d varies. In the second row, d is fixed we find different regularization schemes produce similar and M varies. Shaded areas are three times the standard deviation. results. The iterative score estimator based on the ν-method is among the best and KEF-CG closely tracked them even 350 35 with large d and M. 300 30 n i 250 m

25 λ / x

a 200

20 m 5.2. Wasserstein Autoencoders λ 150 15

Time per Epoch (s) 100 10 Wasserstein autoencoder (WAE) (Tolstikhin et al., 2017) 200 400 600 800 1000 50 100 150 200 250 Latent Dim Latent Dim is a latent variable model p(z, x) with observed and latent variables x ∈ X and z ∈ Z, respectively. p(z, x) is defined (a) Computational Cost (b) λmax/λmin by a prior p(z) and a distribution of z conditioned on x, −5 Figure 2. (a) Computational costs of KEF-CG for λ = 10 on and can be written as p(z, x) = p(z)pθ(x|z). WAEs aim at MNIST; (b) The ratio of the maximum and the minimum eigenval- minimizing Wasserstein distance Wc(pX , pG) between the R ues of kernel matrices. data distribution pX (x) and pG(x) := p(z, x)dz, where c is a metric on X . Tolstikhin et al.(2017) showed that Parametric estimators: In the WAE experiment, we also when pθ(x|z) maps z to x deterministically by a function consider the sliced score matching (SSM) estimator (Song G : Z → X , it suffices to minimize EpX (x)Eqφ(z|x)[kx − et al., 2019), which is a parametric method and requires 2 G(z)k2] + λD(qφ(z), p(z)), where D is a divergence of amortized training. two distributions and qφ(z|x) is a parametric approximation of the posterior. When we choose D to be the KL diver- Proposed: The iterative curl-free estimator with the ν- R method, and the conjugate gradient version of the KEF gence, the entropy term of qφ(z) := qφ(z|x)pX (x)dx in estimator (KEF-CG), both described in Sec. 3.5. the loss function is intractable (Song et al., 2019). If z can be parameterized by fφ(x) with x ∼ pX , the gradient of the entropy can be estimated using score estimators as 5.1. Synthetic Distributions EpX (x)[∇z log qφ(z)∇φfφ(x)] (Shi et al., 2018; Song et al., We follow Sutherland et al.(2018, Sec. 5.1) to construct 2019). a d-dimensional grid distribution. It is the mixture of d We train WAEs on MNIST and CelebA and repeat each con- standard Gaussian distributions centered at d fixed vertices figuration 3 times. The average negative log-likelihoods for in the unit hypercube. We change d and M respectively to MNIST estimated by AIS (Neal, 2001) are reported in Ta- test the accuracy and the convergence of score estimators, ble2. The results for CelebA are reported in appendixA. We and use 1024 samples from the grid distribution to evaluate can see that the performance of these estimators is close in the `2. We report the result of 32 runs in Fig.1. low latent dimensions, and the parametric method is slightly We can see that the effect of hypothesis space is significant. better than nonparametric ones as we have continuously gen- The diagonal kernels used in SSGE and Stein degrade the erated samples. However, in high dimensions, estimators accuracy in high dimensions, while curl-free kernels provide based on curl-free kernels significantly outperform those better performance. In low dimensions, all estimators are based on diagonal kernels and parametric methods. This is comparable, and the computational cost of diagonal kernels probably due to guarantee that the estimates at all locations Nonparametric Score Estimators form a gradient field. References As discussed in Sec. 3.5, curl-free kernels are computation- Alvarez, M. A., Rosasco, L., Lawrence, N. D., et al. Kernels ally expensive. This is shown in Table2 by the running for vector-valued functions: A review. Foundations and time of the original KEF algorithm. By comparing the time Trends R in Machine Learning, 4(3):195–266, 2012. and performance of NKEFα with α = 2, 4, 8, we see that in order to get meaningful speed-up in high dimensions, Baker, C. T. The numerical treatment of integral equations. low-rank approximation methods have to sacrifice the per- 1977. formance, which are outperformed by the iterative curl-free Baldassarre, L., Rosasco, L., Barla, A., and Verri, A. Multi- estimators based on the ν-method. KEF-CG is the fastest output learning via spectral filtering. Machine learning, curl-free method in high dimensions while the performance 87(3):259–301, 2012. is comparable with the original KEF. Fig. 2a shows the training time of KEF-CG in different latent dimensions. Sur- Bauer, F., Pereverzev, S., and Rosasco, L. On regularization prisingly, the speed rapidly increases with increasing latent algorithms in learning theory. Journal of complexity, 23 dimension and then flattens out. The convergence rate of (1):52–72, 2007. conjugate gradient is determined by the condition number, Canu, S. and Smola, A. Kernel methods and the exponential which means the kernel matrix K becomes well-conditioned family. Neurocomputing, 69(7-9):714–720, 2006. in high dimensions (see Fig. 2b). We found with large d, SSGE required at least 97% eigen- Chwialkowski, K., Strathmann, H., and Gretton, A. A kernel values to attain the reported likelihood. We also ran SSGE test of goodness of fit. In International Conference on with curl-free kernels and found only 13% eigenvalues are Machine Learning, pp. 2606–2615, 2016. required to attain a comparable result when d = 8. From De Vito, E., Rosasco, L., and Toigo, A. Learning sets with these observations, a possible reason why diagonal kernels separating kernels. Applied and Computational Harmonic degrade the performance in high dimensions is that the dis- Analysis, 37(2):185–217, 2014. tribution is complicated while the hypothesis set is simple, so the small number of eigenfunctions are insufficient to ap- Engl, H. W., Hanke, M., and Neubauer, A. Regularization proximate the target. This can also be observed from Fig.1, of inverse problems, volume 375. Springer Science & where the performance of diagonal kernels and curl-free Business Media, 1996. kernels are closer as M increases since more eigenfunctions Fukumizu, K. Exponential manifold by reproducing kernel are provided. hilbert spaces. Algebraic and Geometric mothods in , pp. 291–306, 2009. 6. Conclusion Fuselier Jr, E. J. Refined error estimates for matrix-valued Our contributions are two folds. Theoretically, we present a radial basis functions. PhD thesis, Texas A&M Univer- unifying view of nonparametric score estimators, and clarify sity, 2007. the relationships of existing estimators. Under this perspec- tive, we provide a unified convergence analysis of existing Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., estimators, which improves existing error bounds. Practi- Warde-Farley, D., Ozair, S., Courville, A., and Bengio, cally, we propose an iterative curl-free estimator with nice Y. Generative adversarial nets. In Advances in Neural theoretical properties and computational benefits, and de- Information Processing Systems, pp. 2672–2680, 2014. velop a fast conjugate gradient solver for the KEF estimator. Gorham, J. and Mackey, L. Measuring sample quality with stein’s method. In Advances in Neural Information Pro- Acknowledgements cessing Systems, pp. 226–234, 2015. We thank Ziyu Wang for reading an earlier version of this Hyvarinen,¨ A. Estimation of non-normalized statistical manuscript and providing valuable feedback. This work was models by score matching. Journal of Machine Learning supported by the National Key Research and Development Research, 6(Apr):695–709, 2005. Program of China (No. 2017YFA0700904), NSFC Projects (Nos. 61620106010, U19B2034, U1811461), Beijing NSF Li, Y. and Turner, R. E. Gradient estimators for implicit Project (No. L172037), Beijing Academy of Artificial Intel- models. In International Conference on Learning Repre- ligence (BAAI), Tsinghua-Huawei Joint Research Program, sentations, 2018. Tsinghua Institute for Guo Qiang, and the NVIDIA NVAIL Liu, Q., Lee, J., and Jordan, M. A kernelized stein discrep- Program with GPU/DGX Acceleration. JS was also sup- ancy for goodness-of-fit tests. In International Confer- ported by a Microsoft Research Asia Fellowship. ence on Machine Learning, pp. 276–284, 2016. Nonparametric Score Estimators

Macedo,ˆ I. and Castro, R. Learning divergence-free and Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, curl-free vector fields with matrix-valued kernels. IMPA, B. Wasserstein auto-encoders. arXiv preprint 2010. arXiv:1711.01558, 2017. Neal, R. M. Annealed importance sampling. Statistics and Van Loan, C. F. and Golub, G. H. Matrix computations. computing, 11(2):125–139, 2001. Johns Hopkins University Press, 1983. Rosasco, L., Belkin, M., and Vito, E. D. On learning with in- Vincent, P. A connection between score matching and de- tegral operators. Journal of Machine Learning Research, noising autoencoders. Neural computation, 23(7):1661– 11(Feb):905–934, 2010. 1674, 2011. Saremi, S. and Hyvarinen, A. Neural empirical bayes. Jour- Vito, E. D., Rosasco, L., Caponnetto, A., Giovannini, U. D., nal of Machine Learning Research, 20:1–23, 2019. and Odone, F. Learning from examples as an inverse Sasaki, H., Hyvarinen,¨ A., and Sugiyama, M. Clustering problem. Journal of Machine Learning Research, 6(May): via mode seeking by direct estimation of the gradient of 883–904, 2005. Joint European Conference on Machine a log-density. In Warde-Farley, D. and Bengio, Y. Improving generative Learning and Knowledge Discovery in Databases , pp. adversarial networks with denoising feature matching. 19–34. Springer, 2014. International Conference on Learning Representations, Shi, J., Sun, S., and Zhu, J. A spectral approach to gradient 2016. estimation for implicit distributions. In International Wen, L., Zhou, Y., He, L., Zhou, M., and Xu, Z. Mutual in- Conference on Machine Learning, pp. 4651–4660, 2018. formation gradient estimation for representation learning. Smale, S. and Zhou, D.-X. Learning theory estimates via In International Conference on Learning Representations, integral operators and their approximations. Constructive 2020. approximation, 26(2):153–172, 2007. Williams, C. K. and Seeger, M. Using the nystrom¨ method Song, Y. and Ermon, S. Generative modeling by estimating to speed up kernel machines. In Advances in Neural gradients of the data distribution. In Advances in Neural Information Processing Systems, pp. 682–688, 2001. Information Processing Systems, pp. 11895–11907, 2019. Song, Y., Garg, S., Shi, J., and Ermon, S. Sliced score match- ing: A scalable approach to density and score estimation. arXiv preprint arXiv:1905.07088, 2019. Sriperumbudur, B., Fukumizu, K., Gretton, A., Hyvarinen,¨ A., and Kumar, R. Density estimation in infinite dimen- sional exponential families. The Journal of Machine Learning Research, 18(1):1830–1888, 2017. Stein, C. M. Estimation of the mean of a multivariate . The annals of Statistics, pp. 1135–1151, 1981. Strathmann, H., Sejdinovic, D., Livingstone, S., Szabo, Z., and Gretton, A. Gradient-free hamiltonian monte carlo with efficient kernel exponential families. In Advances in Neural Information Processing Systems, pp. 955–963, 2015. Sun, S., Zhang, G., Shi, J., and Grosse, R. Functional variational Bayesian Neural Networks. In International Conference on Learning Representations, 2019. Sutherland, D., Strathmann, H., Arbel, M., and Gretton, A. Efficient and principled score estimation with nystrom¨ kernel exponential families. In International Confer- ence on Artificial Intelligence and Statistics, pp. 652–660, 2018. Nonparametric Score Estimators

In appendixA we provide additional details and further results of experiments. In appendixB, we list the assumptions we used, and prove the non-asymptotic verison of Theorem 4.2 and 4.6. In appendixC, we give the details of Sec.3, including deriving algorithms presented in Sec. 3.2, examples in Sec. 3.4 and a general formula for curl-free kernels. AppendixD includes some technical results used in proofs. Finally, We present samples drawn from trained WAEs in appendixE.

A. Experiment Details and Additional Results

2 2 −1/2 In experiments, we use the IMQ kernel k(x, y) := (1 + kx − yk2/σ ) and its curl-free version in corresponding kernel estimators. We use the median of the pairwise Euclidean distances between samples as the kernel bandwidth. The parameter ν of the ν-method is set to 1. The maximum iteration number of KEF-CG is 40 and the convergence tolerance of it is 10−4.

A.1. Grid Distributions We use αM eigenvalues in SSGE with α searched in {0.99, 0.97, 0.95, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4}. We search the number of iteration steps of the ν-method in {20, 30, 40, 50, 60, 70, 80, 90, 100}. We search the regularization coefficient λ of Stein, NKEF, KEF-CG in {10−k : k = 0, 1, ··· , 8}. The experiments are repeated 32 times.

A.2. Wasserstein Autoencoders

2 We use the standard Gaussian distribution N (0,I) as the prior p(z), and N (µφ(x), σφ(x)) as the approximated posterior qφ(z|x), and Bernoulli(Gθ(z)) as the generator pθ(x|z). We use minibatch size 64. Models are optimized by the Adam optimizer with learning rate 10−4. Each configuration is repeated 3 times, and the mean and the standard deviation are reported in Table3 and Table4. All models are timed on GeForce GTX TITAN X GPU.

Table 3. Negative log-likelihoods on the MNIST dataset and per epoch time on 128 latent dimension.

LATENT DIM 8 32 64 128 TIME STEIN 97.15 ± 0.14 92.10 ± 0.07 101.60 ± 0.44 114.41 ± 0.25 4.2S SSGE 97.24 ± 0.07 92.24 ± 0.17 101.92 ± 0.08 114.57 ± 0.23 9.2S KEF 97.07 ± 0.03 90.93 ± 0.23 91.58 ± 0.03 92.40 ± 0.34 201.1S NKEF2 97.71 ± 0.24 92.29 ± 0.41 92.82 ± 0.18 94.14 ± 0.69 36.4S NKEF4 97.59 ± 0.15 91.19 ± 0.08 91.80 ± 0.12 92.94 ± 0.58 97.5S NKEF8 97.23 ± 0.06 90.86 ± 0.09 92.39 ± 1.32 92.49 ± 0.41 301.2S KEF-CG 97.39 ± 0.22 90.77 ± 0.12 92.66 ± 0.67 92.05 ± 0.06 13.7S ν-METHOD 97.28 ± 0.17 90.94 ± 0.02 91.48 ± 0.09 92.10 ± 0.06 78.1S SSM 96.98 ± 0.27 89.06 ± 0.01 93.06 ± 0.68 96.92 ± 0.08 6.0S

Table 4. Frechet´ Inception Distances on the CelebA dataset and per epoch time on 128 latent dimension.

LATENT DIM 8 32 64 128 TIME STEIN 73.85 ± 1.39 58.29 ± 0.46 57.54 ± 0.57 76.31 ± 1.33 164.4S SSGE 72.49 ± 1.09 58.01 ± 0.60 58.39 ± 1.00 76.85 ± 1.12 172.2S NKEF2 75.12 ± 1.55 53.92 ± 0.29 51.16 ± 0.30 55.17 ± 0.43 244.7S NKEF4 73.15 ± 0.77 54.54 ± 1.02 50.76 ± 0.19 53.70 ± 0.10 412.5S KEF-CG 72.92 ± 0.60 54.32 ± 0.31 50.44 ± 0.20 50.66 ± 0.89 166.2S ν-METHOD 72.02 ± 1.22 52.86 ± 0.20 50.16 ± 0.23 52.80 ± 0.43 220.9S SSM 69.72 ± 0.25 49.93 ± 0.74 72.68 ± 1.75 94.07 ± 3.57 163.3S

2 MNIST We parameterize µφ, σφ and Gθ(z) by fully-connected neural networks with two hidden layers, both of which con- sist of 256 units activated by ReLU. For SSM, the score is parameterized by a fully-connected neural network with two hidden layers consisting of 256 units activated by tanh. The regularization coefficients of Stein, KEF, NKEF, KEF-CG are searched in {10−k : k = 2, 3, ··· , 7} for the best log-likelihood, and the number of iteration steps of the ν-method are searched in {50, 70, ··· , 150}, and we use αM eigenvalues in SSGE with α searched in {0.99, 0.97, 0.95, 0.93, 0.91, 0.89, 0.87}. We run 1000 epoches and evaluate the model by AIS (Neal, 2001), where the parameters are the same as in SSGE. Specifically, Nonparametric Score Estimators we set the step size of HMC to 10−6, and the leapfrog step to 10. We use 5 chains and set the temperature to 103.

2 CelebA We parameterize µφ, Gθ by convolutional neural networks similar to Song et al.(2019). σφ is set to 1. For SSM, we use the same network as in MNIST to parameterize the score. The regularization coefficients of Stein, KEF, NKEF, KEF-CG are searched in {10−k : k = 2, 3, ··· , 7} for the best log-likelihood, and the number of iteration steps of the ν-method are searched in {20, 30, 40, 50, 60, 70}, and we use αM eigenvalues in SSGE with α searched in {0.99, 0.97, 0.95, 0.93, 0.91, 0.89, 0.87}. We run 100 epoches and evaluate the model using the Frechet´ Inception Distance (FID). As KEF and NKEF8 are slow, we do not compare them in this dataset. Results are reported in Table4.

B. Error Bounds

In the following, we suppress the dependence of HK on K for simplicity. We use k·kHS to denote the Hilbert-Schmidt norm of operators. The assumptions required in obtaining an error bound are listed below. Assumption B.1. X is a non-empty open subset of Rd, with piecewise C1 boundary. Assumption B.2. p, log p and each element of K are continuously differentiable. p and its total derivative Dp : X → Rd can both be continuously extended to X¯, where X¯ is the closure of X . Each element of K and its total derivative can be continuously extended to X¯ × X¯. p 1−d Assumption B.3. For all i, j ∈ [d], K(x, x)ijp(x) = 0 on ∂X , and |K(x, x)ij|p(x) = o(kxk2 ) as x → ∞, where ∂X := X\X¯ . T R Assumption B.4. Define an HK-valued random variable ξx := divx Kx , let ξ := X ξxdρ. There are two constants Σ,K, such that Z     2 kξx − ξkH kξx − ξkH Σ exp − − 1 dρ ≤ 2 . X K K 2K 2 Assumption B.5. There is a constant κ > 0 such that supx∈X tr K(x, x) ≤ κ . Assumptions B.1-B.3 are similar to those in Sriperumbudur et al.(2017). They guarantee the integration by parts is valid, so T we can obtain Eρ[Kx∇ log p] = −Eρ[divx Kx ]. Assumptions B.4 and B.5 come from Bauer et al.(2007), and are used in the concentration inequalities. Note that Assumption B.4 can be replaced by a stronger one that kξx − ξkH is uniformly bounded on X . We follows the idea of Bauer et al.(2007, Theorem 10) to prove Theorem 4.2. The non-asymptotic version is given as follows g Theorem B.1. Assume Assumptions B.1-B.5 hold. Let r¯ be the qualification of the regularizer gλ, and sˆp,λ be defined as in (8). Suppose there exists f ∈ H such that s = Lr f , for some r ∈ [0, r¯]. Then for any 0 < δ < 1, √ 0 K p K 0 2 2r+2 − 1 M ≥ (2 2κ log(4/δ)) r , choosing λ = M 2r+2 , the following inequalities hold with probability at least 1 − δ

− r 4 ksˆ − s k ≤ C M 2r+2 log , p,λ p H 1 δ and for r ∈ [0, r¯ − 1/2], we have − 2r+1 4 ksˆ − s k ≤ C M 4r+4 log , p,λ p ρ 2 δ √ √ 2 2 3 where C1 = 2B(K + Σ) + 2 2Bκ kspkH + (γr + κ γcr)kf0kH, and C2 = 2B(K + Σ)κ + 2 2Bκ kspkH + ((γr + 2 2 κ γ 1 cr) + c 1 (γr + κ γcr))kf0kH, and cr is a constant depending on r. Op is the Big-O notation in probability. 2 2

Proof. We consider the following decomposition ˆ ksˆp,λ − spkH ≤ kgλ(LˆK)(ζ − ζ)kH + kgλ(LˆK)LKsp − spkH ˆ ≤ kgλ(LˆK)(ζ − ζ)kH + kgλ(LˆK)(LK − LˆK)spkH + krλ(LˆK)spkH, where rλ(σ) := gλ(σ)σ − 1. By Definition 4.1, we have kgλ(LˆK)k ≤ B/λ. From Lemma D.3 and D.4, with probability at least 1 − δ, we have √ 2B(K + Σ) + 2 2Bκ2 ks k ˆ p H 4 kgλ(LˆK)(ζ − ζ)kH + kgλ(LˆK)(LK − LˆK)spkH ≤ √ log . λ M δ Nonparametric Score Estimators

ˆ r r ˆ By Definition 4.1, krλ(LK)LKk ≤ γrλ and krλ(LK)k ≤ γ, then

ˆ ˆ ˆr ˆ r ˆr krλ(LK)spkH ≤ krλ(LK)LKf0kH + krλ(LK)(LK − LK)f0kH r r ˆr ≤ γrλ kf0kH + γkLK − LKk kf0kH .

r ˆr ˆ r When r ∈ [0, 1], from Bauer et al.(2007√, Theorem 1), there exists a constant cr such that kLK − LKk ≤ crkLK − LKk . Then by Lemma D.4, and choose λ ≥ 2 2κ2M −1/2 log(4/δ), we have

√ r 2 ! r r 2 2κ 4 r kL − Lˆ k ≤ cr √ log ≤ crλ . K K M δ

Collecting the above results,   A1 r 4 ksˆp,λ − spkH ≤ √ + A2λ log , λ M δ − 1 where A1,A2 are constants which do not depend on λ and M. Then, we can choose λ = M 2r+2 to obtain the bound. √ r √ Combining with λ ≥ 2 2κ2M −1/2 log(4/δ), we require M 2r+2 ≥ 2 2κ2 log(4/δ). 0 r ˆr 0 ˆ r ˆr When√ r > 1, from Lemma D.5, there exists a constant cr such that kLK −LKkHS ≤ crkLK −LKkHS. Then kLK −LKkHS ≤ 0 2 −1/2 2 2crκ M log(4/δ), and a similar discussion can be applied to obtain the bound. √ Note that ksˆp,λ − spkρ = k LK(ˆsp,λ − sp)kH. Then we can apply the above discussion to obtain the bound for k·kρ.

Next, we give the non-asymptotic version of Theorem 4.6 as follows −1 n Theorem B.2. Under the same assumption of Theorem B.1, we define gλ(σ) := (λ+σ) , and choose Z := {z }n∈[N] ⊆ X . Let Y := {ym} be a set of i.i.d. samples drawn from ρ. Let sˆ be defined as in (8) with X = Z ∪ Y. Suppose m∈[M] √ p,λ,Z α 2 2r+2 − 1 N = M , then for any 0 < δ < 1, M ≥ (2 2κ log(4/δ)) r , choosing λ = M 2r+2 , the following inequalities hold with probability at least 1 − δ

− r 4 α− r sup ksˆp,λ,Z − spkH ≤ C1M 2r+2 log + C3M r+1 , Z δ

2 2 n where C3 := 2(κ + 1) kspkH, and the supZ is taken over all {z }n∈[N] ⊂ X . r In particular, when α = 2r+2 , we have

− r 4 sup ksˆp,λ,Z − spkH ≤ (C1 + C3)M 2r+2 log . Z δ

1 ∗ 1 N ˆ Proof. We define TZ := N SZSZ, where SZf := (f(z ), ··· , f(z )) is the sampling operator. Let LK := TY and sˆp,λ be ˆ ˆ N ˆ the estimator obtained from Y. Then we can write sˆp,λ,Z := gλ(LK + RZ)(LK + RZ)sp, where RZ := M+N (TZ − LK). We can bound the error as follows

ksˆp,λ,Z − spkH ≤ ksˆp,λ,Z − sˆp,λkH + ksˆp,λ − spkH ˆ ˆ ˆ ˆ ≤ k(gλ(LK + RZ) − gλ(LK))LKspkH + kgλ(LK + RZ)RZspkH + ksˆp,λ − spkH.

−1 The last term has been bounded by Theorem B.1, and we consider the first two terms. Since gλ(σ) = (λ+σ) is Lipschitz in ˆ ˆ 2 ˆ [0, ∞), from Lemma D.5, we have kgλ(LK +RZ)−gλ(LK)kHS ≤ kRZkHS/λ . Note kgλ(LK +RZ)RZkHS ≤ kRZkHS/λ, we obtain κ2 1  κ2 1  2κ2N ksˆ − sˆ k ≤ + kR k ks k ≤ + ks k p,λ,Z p,λ H λ2 λ Z HS p H λ2 λ M + N p H 2 2 2(κ + 1) N 2 2 α− r ≤ ks k = 2(κ + 1) M r+1 ks k . λ2M p H p H Combining with Theorem B.1, and noticing that the right hand does not depend on Z, we obtain the final bound. Nonparametric Score Estimators

Finally, we prove the error bound of the Stein estimator with its original out-of-sample extension.

Proof of Corollary 4.7. The Stein estimator at point x ∈ X can be written as

d X sˆp,λ,x(x) = hKxei, sˆp,λ,xiHei, i=1

d where {ei} is the standard basis of R . Note that

d X 2 sup ksˆp,λ,x(x) − sp(x)k2 ≤ sup |hKxei, sˆp,λ,x − spiH| ≤ κ sup ksˆp,λ,x − spkH. x∈X i=1 x∈X x∈X Then, the bound of Stein estimator immediately follows from Theorem 4.6.

C. Details in Section3 C.1. A General Version of Nystrom¨ KEF In this section, we briefly review the Nystrom¨ version of KEF (NKEF, Sutherland et al.(2018)) and give a more general version of it in our framework. One of the drawbacks of KEF, as we have mentioned before, is the high computational complexity. It requires to solve an Md × Md linear system, where M is the sample size and d is the dimension. Note that the solution of KEF in (3) lies in m ˆ the subspace generated by {∂ik(x , ·): i ∈ [d], m ∈ [M]} ∪ {ξ}. The Nystrom¨ version of KEF consider to minimize the n n loss (2) in a smaller subspace generated by {∂ik(z , ·): i ∈ [d], n ∈ [N]}, where N  M and {z } is a subset randomly sampled from {xm}. Sutherland et al.(2018) showed that it suffices to solve an Nd × Nd linear system, which reduces the computational complexity, while the convergence rate remains the same as that of KEF if N = Ω(M θ log M), where θ ∈ [1/3, 1/2].

In our framework, we can also consider to find our estimator in a smaller subspace. Let HZ be the subspace generated by n d {z }n∈[N], i.e., HZ := span{Kzn c : n ∈ [N], c ∈ R }. Consider the minimization problem, which is a modification of (6), where the solution is found in HZ:

M Z 1 X m m 2 λ 2 sˆ = arg min ks(x ) − sp(x )k + ksk . (16) p,λ M 2 2 HK s∈HZ m=1

Z ˆ −1 ˆ ˆ ˆ The solution can be written as sˆp,λ = (PZLKPZ + λI) PZζ, where ζ, LK are defined as in Sec. 3.1 and PZ : HK → HK is the projection operator onto HZ, which can be defined as

2 ∗ ∗ −1 PZf := arg min kg − fkH = SZ(SZSZ) SZf, g∈HZ ∗ where SZ,SZ is the sampling operator and its adjoint, respectively. This motivates us to define the Nystrom¨ version of our score estimators for general regularization schemes as follows: g,Z ˆ ˆ sˆp,λ := −gλ(PZLKPZ)PZζ. (17) To obtain the matrix form of (17), we first introduce two operators: ˆ L := PZLKPZ, 1 1 − 2 − 2 L := KZZ KZXKXZKZZ . We want to connect the spectral decompositions of L and L as in Lemma 3.2. Suppose the spectral decomposition of L is − 1 PMd T ∗ 2 σiuiu , where kuik Md = 1. Consider vi := S K ui, we can verify that i=1 i R Z ZZ 1 1 2 − 2 T − 2 T kvikH = (KZZ ui) KZZKZZ ui = ui ui = 1, 1 1 ∗ −1 − 2 ∗ − 2 Lvi = SZKZZ KZXKXZKZZ ui = SZKZZ Lui = σivi. Nonparametric Score Estimators

PMd Thus, L = i=1 σi hvi, ·iH vi is the spectral decomposition of L. The estimator can be written as

Md ! 1 1 1 1 g,Z ∗ − 2 X T − 2 ∗ − 2 − 2 sˆp,λ = −SZKZZ gλ(σi)uiui KZZ h = −SZKZZ gλ(L)KZZ h. (18) i=1 The above estimator only involves smaller matrices. However, it requires some expensive matrix manipulations like the matrix square root for general regularization schemes. Fortunately, these expensive terms can be cancelled when using the Tikhonov regularization: −1 Example C.1. When we consider the Tikhonov regularization gλ(σ) = (σ + λ) and curl-free kernels, the score estimator 1 1 g,Z − 2 −1 − 2 −1 (18) becomes sˆp,λ(x) = −KxZKZZ (L + λI) KZZ h = −KxZ(KZXKXZ + λKZZ) h. Similar to Example 3.5, we find this is exactly the same as the NKEF estimator obtained in Sutherland et al.(2018, Theorem 1).

C.2. Computational Details Details of Example 3.6 Using the notation in Example 3.6 and Sec. 2.2, we can reformulate SSGE into a matrix form as follows: J M ! X 1 X gˆ (x) = − ∂ ψˆ (xn) ψ (x) i M i j j j=1 n=1 J √ M ! √ M ! X 1 M X n m (m) M X ` (`) = − ∂ik(x , x )wj k(x, x )wj M λj λj j=1 n,m=1 `=1 J M ! M ! X 1 X (m) X (`) = − ∂ k(xn, xm)w k(x, x`)w λ2 i j j j=1 j n,m=1 `=1

M M  J (m) (`)  X X X wj wj = − k(x, x`) ∂ k(xn, xm)  λ2  i `=1 n,m=1 j=1 j

M M  J (m) (`)  M ! X X X wj wj X = − k(x, x`) ∂ k(xn, xm)  λ2  i `=1 m=1 j=1 j n=1   J T X wjwj = −k(x, X) r ,  λ2  i j=1 j

PM n j where ri,j = n=1 ∂ik(x , x ), and w1, ··· , wM is the unit eigenvectors of k(X, X) corresponding to eigenvalues (m) λ1 ≥ · · · ≥ λM . wj is the m-th component of wj. Note that when using diagonal kernels, we have K(x, y) = k(x, y)⊗Id, then the eigenvectors of K(X, X) are {wi ⊗ ej : i ∈ [M], j ∈ [d]} and the eigenvalue corresponds to wi ⊗ ej is λi, where d {ej} is the standard basis of R . We also note that in this case

M M m 1 X ` m 1 X ` m h = ζˆ(x ) = (div ` K(x , x )) = ∂ k(x , x ) = Mr . (m−1)d+i i M x i M i i,m `=1 `=1 Comparing with (12), we find that SSGE is equivalent to use diagonal kernels and spectral cut-off regularization.

−1 Details of Example 3.7 For the regularizer gλ(σ) := (λ+σ) 1{σ>0}, from Lemma C.2 we know when K is non-singular, g −1 1 −1 sˆp,λ(x) = −KxXK ( M K + λI) h. Next, we consider the minimization problem in (6), and ignore the one-dimensional ˆ subspace Rζ of the solution space, and assume the solution is KxXc as before. We can rewrite the objective in (6) to 1 cTK2c + λcKc + 2cTh. M

1 2 By taking gradient, we find c satisfies ( M K +λK)c = −h, so it is equivalent to use the previously mentioned regularization. Nonparametric Score Estimators

C.3. Curl-Free Kernels Recover the Function From Its Gradient. Since vector fields in a curl-free RKHS is always the gradient of some functions, it is possible to recover these functions from its gradient. Specifically, suppose the curl-free kernel is defined by 2 Kcf (x, y) = −∇ ψ(x − y) and f ∈ HKcf . Assume f is of the following form

m m d  m d  X i X X i (j) X X i (j) f = Kcf (x , ·)ci = − ∇(∂jψ(x − ·))ci = ∇ − ∂jψ(x − ·)ci  , i=1 i=1 j=1 i=1 j=1

(j) where ci is the j-th component of ci. Then, we find a desired function whose gradient is f.

2 The Special Structure of Kcf (x, y) = −∇ φ(kx − yk). As we have mentioned in Sec. 3.5, curl-free kernels have some 2 0 T special structures. Suppose Kcf is a curl-free kernel defined by ∇ φ(r), where r = (x − x ) and r = krk. Then ∂ r φ = φ0 i , ∂ri r r ∂ 00 ri 0 eir − ri r ∇ φ = φ 2 r + φ 2 , ∂ri r r where ei is the i-th column of the identity matrix. Then the curl-free kernel is of the form  φ0 φ00  φ0 K (x, y) = − rrT − I. (19) cf r3 r2 r

We also obtain a divergence formula for such kernel. Note that

2 2 2 r ri (ri + rjδij)r − 2r ri ∂ ∂ φ = φ000 j + φ00 j jj i r3 r4 r δ r − r rj φ0 + φ00 j ij i r + (δ r − r )r3 − 3rr (δ r2 − r r ) , r r2 r6 ij j i j ij i j where δij = [i = j]. Next, we sum out j and then obtain r  d − 1  φ0(r) div K (x, x0) = −∆(∂ φ)(r) = − φ000(r) + φ00(r) − . (20) x cf i r r r

2 2 2 The Special Structure of Kcf (x, y) = −∇ ϕ(kx − yk ). Since many frequently used kernels only depend on kx − yk , 2 2 we consider the structure of curl-free kernels of these types. Suppose Kcf is a curl-free kernel defined by ∇ ϕ(r ), where r = (x − x0)T and r = krk. Then, using (19) and (20) we can find

00 T 0 Kcf (x, y) = −4ϕ rr − 2ϕ I, (21) 00 2 000 divx Kcf (x, y) = −4[(d + 2)ϕ + 2r ϕ ]r. (22)

C.4. Details of Different Regularization Schemes

C.4.1. TIKHONOV REGULARIZATION −1 −1 ˆ Proof of Theorem 3.1. When gλ(σ) = (σ + λ) , the estimator is sˆp,λ = −(LˆK + λI) ζ. We need to compute the −1 ˆ explicit formula of the inverse of LˆK + λI. Note that (LˆK + λI) ζ is the solution of the following minimization problem

M 1 X 2 sˆg = arg min s(xi)Ts(xi) + 2hs, ζˆi + λ ksk . p,λ M H H s∈HK i=1

From the general representer theorem (Sriperumbudur et al., 2017, Theorem A.2), the minimizer lies in the space generated by d ˆ {Kxi c : i ∈ [M], c ∈ R } ∪ {ζ}. Nonparametric Score Estimators

We can assume M g X ˆ sˆp,λ = Kxi ci + aζ. i=1

ˆ 1 ˆ M Define c := (c1, ··· , cM ) and h := (ζ(x ), ··· , ζ(x )), then the optimization objective can be written as

1 (cTK2c + 2acTKh + a2hTh) + 2(akζˆk2 + hTc) + λ(cTKc + 2acTh + a2kζˆk2 ). M H H

Taking the derivative, we need to solve the following linear system

1 (K2c + aKh) + h + λ(Kc + ah) = 0, M 1 (ahTh + cTKh) + (1 + λa)kζˆk2 + λcTh = 0. M H

By some calculations, this system is equivalent to a = −1/λ and (K + MλI)c = h/λ.

C.4.2. SPECTRAL CUT-OFF REGULARIZATION

d m T Proof of Lemma 3.2. Let H0 be the subspace of HK generated by {Kxm c : c ∈ R , m ∈ [M]}. Note that f(x ) c = hK(·, m) , fi = 0 f ∈ H⊥ ∈ d Lˆ = 0 H⊥ Lˆ v ∈ H v( m) = √x c H for any 0 and c R . We know K on 0 . Also note K 0 and x u(m) Mσ, then

M M 1 X 1 X √ Lˆ v(xk) = K(xk, xm)v(xm) = √ K(xk, xm) σu(m) = σv(xk), K M m=1 M m=1 and we conclude that LˆKv = σv. The following equation shows v is normalized:

M M M 2 1 X D m (m) E 1 X D (m) m E X (m) T (m) kvkH = √ K(·, x )u , v = √ u , v(x ) = (u ) u = 1. H d Mσ m=1 Mσ m=1 R m=1

Theorem 3.3 is a corollary of the following lemma, which provides a general form for the regularizer gλ with gλ(0) = 0. 2 Lemma C.2. Let gλ : [0, κ ] → R be a regularizer such that gλ(0) = 0. Let (σj, uj)j≥1 be the non-zero eigenvalue and 1 eigenvector pairs that satisfy M Kuj = σjuj. Then we have   ˆ ˆ X gλ(σi) T gλ(LK)ζ = KxX uiui h, Mσi where KxX and h are defined as in Theorem 3.1.

Proof. Let {(µi, vi)} be the pairs of non-zero eigenvalues and eigenfunctions of LˆK : H → H, then by Lemma 3.2 we have σi = µi. Note that X X LˆK = µihvi, ·iHvi and gλ(LˆK) = gλ(µi)hvi, ·iHvi.

From Lemma 3.2, we have Nonparametric Score Estimators

ˆ X ˆ gλ(LˆK)ζ = gλ(σi)hvi, ζiHvi

 * M + M  X  1 X (j) ˆ 1 X (k) = gλ(σi) √ Kxj ui , ζ √ Kxk ui  Mσi Mσi  j=1 H k=1 M 1 X X −1 D (j) ˆE (k) = gλ(σi)σi Kxj ui , ζ Kxk ui M H j,k=1 M 1 X X −1 j T (j) (k) = g (σ )σ ζˆ(x ) u K k u M λ i i i x i j,k=1   X gλ(σi) T = KxX uiui h. Mσi

C.4.3. ITERATIVE REGULARIZATION g Pt−1 i −1 Theorem C.3 (Landweber iteration). Let sˆp,λ be defined as in (8), and gλ(σ) = η i=0(1 − ησ) , where t := bλ c. Then we have g ˆ sˆp,λ(x) = −tηζ(x) + KxXct, 2 where c0 = 0 and ct+1 = (Id − ηK/M)ct − tη h/M, and KxX and h are defined as in Theorem 3.1.

Proof. We note that the iteration process is

(1) ˆ sˆp = −ηζ, (t) ˆ ˆ (t−1) sˆp = −ηζ + (I − ηLK)ˆsp (t−1) ˆ ˆ (t−1) =s ˆp + η(−ζ − LKsˆp ).

(t) where we define sˆp :=s ˆp,1/t. We can assume

(t) ˆ sˆp = atζ + KxXct.

Then, by induction, (t) ˆ ˆ ˆ sˆp = −ηζ + (I − ηLK)(at−1ζ + KxXct−1) ˆ = (at−1 − η)ζ + KxX(ct−1 + ηat−1h/M − ηKct−1/M).

2 Thus, we have at = −tη and ct = (I − ηK/M)ct−1 − (t − 1)η h/M, and c1 = 0.

Before introducing the ν-method, we recall that the iterative regularization can be represented by a family of polynomials gλ(σ) = poly(σ), where gλ converges to the function 1/σ as λ → 0. For example, in the Landweber iteration we see that

t−1 X 1 − (1 − ησ)t g (σ) = η (1 − ησ)i = . λ σ i=0 We can verify that the identification of λ and t−1 satisfies Definition 4.1 about the regularization. To see the qualification, r r t −1 we note that the maximum |1 − σgλ(σ)|σ = σ (1 − ησ) over [0, η ] is attained when σ = r/(rη + t) and hence

t r r r t r r  r r sup |1 − σgλ(σ)|σ ≤ r+t ≤ = max(r , 1)λ . 0≤σ≤η−1 (rη + t) t

Thus, we see that the qualification is ∞. Nonparametric Score Estimators

Example C.4 (ν-method). The ν-method (Engl et al., 1996) is an accelerated version of the Landweber iteration. The idea behind it is to find better polynomials pt(σ) to approximate the function 1/σ, where pt is a polynomial of degree t. These ν 2ν polynomials satisfy sup0≤σ≤1 |1 − σpt(σ)|σ ≤ cν t . Compared with the definition of the qualification in Definition 4.1, we can identify λ and t−2. Thus, for the same regularization parameter, the ν-method only requires about λ−1/2 iterations while the Landweber iteration requires about λ−1 iterations. For more details about the construction of these polynomials, we refer the readers to Engl et al.(1996, Appendix A.1 and Section 6.3)

−1/2 (t) Below we give the algorithm of the ν-method, where t = bλ c and sˆp,λ :=s ˆp .

(0) (1) ˆ sˆp = 0, sˆp = −ω1ζ, (t) (t−1) (t−1) (t−2) ˆ ˆ (t−1) sˆp =s ˆp + ut(ˆsp − sˆp ) + ωt(−ζ − LKsˆp ), where (t − 1)(2t − 3)(2t + 2ν − 1) u = , t (t + 2ν − 1)(2t + 4ν − 1)(2t + 2ν − 3) 4(2t + 2ν − 1)(t + ν − 1) ω = . t (t + 2ν − 1)(2t + 4ν − 1)

Smilarly, we can assume (t) ˆ sˆp = atζ + KxXct.

Then, by induction,

(t)  ˆ  (t−1) (t−2) ˆ sˆp = 1 + ut − ωtLK sˆp − utsˆp − ωtζ  ˆ  ˆ ˆ ˆ = 1 + ut − ωtLK (at−1ζ + KxXct−1) − ut(at−2ζ + KxXct−2) − ωtζ ˆ = ((1 + ut)at−1 − utat−2 − ωt) ζ  ω  + K (1 + u )c − t (a h + Kc ) − u c . xX t t−1 M t−1 t−1 t t−2

Thus, we obtain the iteration formula for at and ct as follows:

at := (1 + ut)at−1 − utat−2 − ωt, ω c := (1 + u )c − t (a h + Kc ) − u c , t t t−1 M t−1 t−1 t t−2 and c0 = c1 = 0, a0 = 0, a1 = −ω1.

D. Technical Results

Lemma D.1. Suppose Assumption B.5 holds, then LK, LˆK : HK → HK are positive, self-adjoint, trace class operators. 2 2 Moreover, tr LK ≤ κ and tr LˆK ≤ κ .

Proof. The result follows from a simple calculation. It is easy to see LK and LˆK are positive and self-adjoint. We prove d they are in trace class. Let {ϕi} be a orthonormal basis of HK and {ei} be the standard basis of R , then

d X Z X X Z X tr LK = hLKϕi, ϕiiH = hKxϕi, ϕiiHdρ = hhKxek, ϕiiHKxek, ϕiiHdρ i X i k=1 X i d Z d Z Z X X 2 X 2 2 = |hKxek, ϕiiH| dρ = kKxekkHdρ = tr K(x, x)dρ ≤ κ k=1 X i k=1 X X

2 Similarly, we have tr LˆK ≤ κ . Nonparametric Score Estimators

We need the following concentration inequality in Hilbert spaces used in Bauer et al.(2007). Lemma D.2 (Bauer et al.(2007), Proposition 23) . Let ξ be a random variable with values in a real Hilbert space H. Assume there are two constants σ, H, such that 1 [kξ − ξkm] ≤ m!σ2Hm−2, ∀m ≥ 2. E E H 2

Then, for all n ∈ N, 0 < δ < 1, the following inequality holds with probability at least 1 − δ H σ  2 kξˆ− ξk ≤ 2 + √ log , E H n n δ

ˆ 1 Pn where ξ = n i=1 ξi and {ξi} are independent copies of ξ. Lemma D.3. Under Assumption B.4, we have for all M ∈ N, 0 < δ < 1, the following inequality holds with probability at least 1 − δ   ˆ K Σ 2 kζ − ζkH ≤ 2 + √ log , (23) M M δ ˆ 1 PM T m where ζ = M m=1 divxm Kxm and {x } is the set of i.i.d. samples from ρ.

T Proof. Define an HK-valued random variable ξx := divx Kx . It is easy to see Ex∼ν [ξx] = −LKsp =: ξ. From Assump- tion B.4, we have for m ≥ 2,  kξ − ξk  kξ − ξk  1 [kξ − ξkm] ≤ m!Km exp x H − x H − 1 ≤ m!Σ2Km−2. Eν x H Eν K K 2

ˆ 1 P ˆ Note that ζ = M m=1 ξxm and Eν ζ = ξ. Then (23) follows from Lemma D.2. Lemma D.4. Under Assumption B.5, we have for all M ∈ N, 0 < δ < 1, the following inequality holds with probability at least 1 − δ √ 2 2κ2 r 2 kLˆK − LKkH ≤ √ log . (24) M δ

Proof. This is a direct consequence of Vito et al.(2005, Lemma 8) and Lemma D.1.

The following useful lemma is from De Vito et al.(2014, Lemma 7) and Sriperumbudur et al.(2017, Lemma 15) Lemma D.5. Suppose S and T are two self-adjoint Hilbert-Schmidt operators on a separable Hilbert space H with spectrum contained in the interval [a, b]. Given a Lipschitz function r :[a, b] → R with Lipschitz constant Lr, we have

kr(S) − r(T )kHS ≤ Lr kS − T kHS . Nonparametric Score Estimators E. Samples

Table 5. WAE samples on MNIST. d = 8 d = 32 d = 64 d = 128 Stein SSGE SSM 2 NKEF -method ν KEF-CG Nonparametric Score Estimators

Table 6. WAE samples on CelebA. d = 8 d = 32 d = 64 d = 128 Stein SSGE SSM 2 NKEF -method ν KEF-CG