Nonparametric Score Estimators

Nonparametric Score Estimators Yuhao Zhou 1 Jiaxin Shi 1 Jun Zhu 1 Abstract clude parametric score matching (Hyvarinen¨ , 2005; Sasaki et al., 2014; Song et al., 2019), its denoising variants as Estimating the score, i.e., the gradient of log den- autoencoders (Vincent, 2011), nonparametric score match- sity function, from a set of samples generated by ing (Sriperumbudur et al., 2017; Sutherland et al., 2018), an unknown distribution is a fundamental task and kernel score estimators based on Stein’s methods (Li in inference and learning of probabilistic mod- & Turner, 2018; Shi et al., 2018). They have been success- els that involve flexible yet intractable densities. fully applied to applications such as estimating gradients of Kernel estimators based on Stein’s methods or mutual information for representation learning (Wen et al., score matching have shown promise, however 2020), score-based generative modeling (Song & Ermon, their theoretical properties and relationships have 2019; Saremi & Hyvarinen, 2019), gradient-free adaptive not been fully-understood. We provide a unifying MCMC (Strathmann et al., 2015), learning implicit mod- view of these estimators under the framework of els (Warde-Farley & Bengio, 2016), and solving intractabil- regularized nonparametric regression. It allows ity in approximate inference algorithms (Sun et al., 2019). us to analyse existing estimators and construct new ones with desirable properties by choosing Recently, nonparametric score estimators are growing in different hypothesis spaces and regularizers. A popularity, mainly because they are flexible, have well- unified convergence analysis is provided for such studied statistical properties, and perform well when sam- estimators. Finally, we propose score estimators ples are very limited. Despite a common goal, they have based on iterative regularization that enjoy com- different motivations and expositions. For example, the putational benefits from curl-free kernels and fast work Sriperumbudur et al.(2017) is motivated from the convergence. density estimation perspective and the richness of kernel ex- ponential families (Canu & Smola, 2006; Fukumizu, 2009), where the estimator is obtained by score matching. Li & 1. Introduction Turner(2018) and Shi et al.(2018) are mainly motivated by Stein’s methods. The solution of Li & Turner(2018) Intractability of density functions has long been a central gives the score prediction at sample points by minimizing challenge in probabilistic learning. This may arise from the kernelized Stein discrepancy (Chwialkowski et al., 2016; various situations such as training implicit models like Liu et al., 2016) and at an out-of-sample point by adding it GANs (Goodfellow et al., 2014), or marginalizing over a to the training data, while the estimator of Shi et al.(2018) non-conjugate hierarchical model, e.g., evaluating the out- is obtained by a spectral analysis in function space. put density of stochastic neural networks (Sun et al., 2019). In these situations, inference and learning often require eval- As these estimators are studied in different contexts, uating such intractable densities or optimizing an objective their relationships and theoretical properties are not fully- understood. In this paper, we provide a unifying view of arXiv:2005.10099v2 [stat.ML] 30 Jun 2020 that involves them. them under the regularized nonparametric regression frame- Among various solutions, one important family of meth- work. This framework allows us to construct new estimators ods are based on score estimation, which rely on a key with desirable properties, and to justify the consistency and step of estimating the score, i.e., the derivative of the improve the convergence rate of existing estimators. It also log density rx log p(x) from a set of samples drawn from allows us to clarify the relationships between these estima- some unknown probability density p. These methods in- tors. We show that they differ only in hypothesis spaces and regularization schemes. 1Dept. of Comp. Sci. & Tech., BNRist Center, Institute for AI, Tsinghua-Bosch ML Center, Tsinghua University. Correspondence Our contributions are both theoretical and algorithmic: to: J. Zhu <[email protected]>. Proceedings of the 37 th International Conference on Machine • We provide a unifying perspective of nonparametric Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by score estimators. We show that the major distinction of the author(s). the KEF estimator (Sriperumbudur et al., 2017) from Nonparametric Score Estimators the other two estimators lies in the use of curl-free 2.1. Vector-Valued Learning kernels, while Li & Turner(2018) and Shi et al.(2018) Supervised vector-valued learning amounts to learning a differ mostly in regularization schemes, with the for- vector-valued function f : X!Y from a training set mer additionally ignores a one-dimensional subspace z z = f(xm; ym)g , where X ⊆ d, Y ⊆ q. Here we in the hypothesis space. We provide a unified conver- m2[M] R R assume the training data is sampled from an unknown distri- gence analysis under the framework. bution ρ(x; y), which can be decomposed into ρ(yjx)ρX (x). • We justify the consistency of the Stein gradient es- A criterion for evaluating such an estimator is the mean 2 timator (Li & Turner, 2018), although the originally squared error (MSE) E(f) := Eρ(x;y) kf(x) − yk2. It proposed out-of-sample extension is heuristic and ex- is well-known that the conditional expectation fρ(x) := pensive. We provide a natural and principled out-of- Eρ(yjx)[y] minimizes E. In practice, we minimize the empir- sample extension derived from our framework. For 1 PM m m 2 ical error Ez(f) := M m=1 kf(x ) − y k2 in a certain both approaches we provide explicit convergence rates. hypothesis space F. However, the minimization problem • From the convergence analysis we also obtain the ex- is typically ill-posed for large F. Hence, it is convenient to plicit rate for Shi et al.(2018), which can be shown to consider the regularized problem: improve the error bound of Shi et al.(2018). 2 fz,λ := arg min Ez(f) + λ kfkF ; (1) • Our results suggest favoring curl-free estimators in f2F high dimensions. To address the scalability challenge, where k·kF is the norm in F. In the vector-valued case, it we propose iterative score estimators by adopting the is typical to consider a vector-valued RKHS HK associated ν-method (Engl et al., 1996) as the regularizer. We with a matrix-valued kernel K as the hypothesis space. Then PM m show that the structure of curl-free kernels can further the estimator is fz,λ = m=1 Kxm c , where Kxm denotes accelerate such algorithms. Inspired by a similar idea, m m 1 the function K(x ; ·). c solves the linear system ( M K + we propose a conjugate gradient solver of KEF that is 1 i j 1 M λI)c = M y with Kij = K(x ; x ); c = (c ; ··· ; c ); y = significantly faster than previous approximations. (y1; ··· ; yM ). For convenience, we define the sampling operator Sx : Notation We always assume ρ is a probability mea- Mq 1 M HK ! R as Sx(f) := (f(x ); ··· ; f(x )). Its ad- ∗ Mq sure with probability density function p(x) supported S : !H hS (f); i Mq = joint x R K that satisfies x c R on X ⊂ d, and L2(X ; ρ; d) is the Hilbert space ∗ Mq ∗ 1 M R R hf; Sx (c)iHK ; 8f 2 HK; c 2 R is Sx (c ; ··· ; c ) = d M m 1 ∗ 1 ∗ of all square integrable functions f : X! R with P m m=1 Kx c . Since ( M Sx Sx + λI)fz,λ = M Sx Kc + 1 inner product hf; giL2(X ,ρ; d) = Ex∼ρ[hf(x); g(x)i d ]. ∗ ∗ R R λSx c = M Sx y, the estimator now can be written as −1 We denote by h·; ·iρ and k·kρ the inner product and 1 ∗ 1 ∗ f ,λ = S S + λI S y. In fact, if we consider 2 d z M x x M x the norm in L (X ; ρ; ), respectively. We denote k 2 R the data-free limit of (1): arg min E(f) + λ kfk , as a scalar-valued kernel, and K as a matrix-valued f2HK HK d×d the minimizer is unique when λ > 0 and is given by kernel K : X × X ! R satisfying the following −1 fλ := (LK +λI) LKfρ, where LK : HK !HK is the in- conditions: (1) K(x; x0) = K(x0; x)T for any x; x0 2 X ; R Pm T tegral operator defined as LKf := X Kxf(x)dρX (Smale (2) i;j=1 ci K(xi; xj)cj ≥ 0 for any fxig ⊂ X and 1 ∗ d & Zhou, 2007). It turns out that M Sx Sx is an empirical es- fcig ⊂ R . We denote a vector-valued reproducing kernel ^ 1 PM m 1 ∗ timate of LK: LKf := m=1 Kxm f(x ) = Sx Sxf. Hilbert space (RKHS) associated to K by HK, which is the M M m ^ 1 ∗ P d It can also be shown that LKfρ = Sx y. Hence, we can closure of i=1 K(xi; ·)ci : xi 2 X ; ci 2 R ; m 2 N M ^ −1 ^ under the norm induced by the inner product write fz,λ = (LK + λI) LKfρ. T hK(xi; ·)ci; K(xj; ·)sji := ci K(xi; xj)sj. We define As we have mentioned, the role of regularization is to deal Kx := K(x; ·) and [M] := f1; ··· ;Mg for M 2 Z+. For ^ s×t with the ill-posedness. Specifically, LK is not always in- A1; ··· ; An 2 R , we use (A1; ··· ; An) to represent ns×t vertible as it has finite rank and HK is usually of infinite a block matrix A 2 R with A(i−1)s+j;k being dimension.

Load more