Characteristic Kernels and Infinitely Divisible Distributions
Total Page:16
File Type:pdf, Size:1020Kb
Journal of Machine Learning Research 17 (2016) 1-28 Submitted 3/14; Revised 5/16; Published 9/16 Characteristic Kernels and Infinitely Divisible Distributions Yu Nishiyama [email protected] The University of Electro-Communications 1-5-1 Chofugaoka, Chofu, Tokyo 182-8585, Japan Kenji Fukumizu [email protected] The Institute of Statistical Mathematics 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan Editor: Ingo Steinwart Abstract We connect shift-invariant characteristic kernels to infinitely divisible distributions on Rd. Characteristic kernels play an important role in machine learning applications with their kernel means to distinguish any two probability measures. The contribution of this paper is twofold. First, we show, using the L´evy–Khintchine formula, that any shift-invariant kernel given by a bounded, continuous, and symmetric probability density function (pdf) of an infinitely divisible distribution on Rd is characteristic. We mention some closure properties of such characteristic kernels under addition, pointwise product, and convolution. Second, in developing various kernel mean algorithms, it is fundamental to compute the following values: (i) kernel mean values mP (x), x , and (ii) kernel mean RKHS inner products ∈ X mP ,mQ H, for probability measures P,Q. If P,Q, and kernel k are Gaussians, then the computationh i of (i) and (ii) results in Gaussian pdfs that are tractable. We generalize this Gaussian combination to more general cases in the class of infinitely divisible distributions. We then introduce a conjugate kernel and a convolution trick, so that the above (i) and (ii) have the same pdf form, expecting tractable computation at least in some cases. As spe- cific instances, we explore α-stable distributions and a rich class of generalized hyperbolic distributions, where the Laplace, Cauchy, and Student’s t distributions are included. Keywords: Characteristic Kernel, Kernel Mean, Infinitely Divisible Distribution, Con- jugate Kernel, Convolution Trick arXiv:1403.7304v3 [stat.ML] 25 Oct 2016 1. Introduction Let ( , ( )) be a measurable space and ( ) be the set of probability measures. Let X B X M1 X be the real-valued reproducing kernel Hilbert space (RKHS) associated with a bounded H and measurable positive-definite (p.d.) kernel k : R. In machine learning, X×X → kernel methods provide a technique for developing nonlinear algorithms, by mapping data X , , X in to higher- or infinite-dimensional RKHS functions k( , X ),...,k( , X ) in 1 · · · n X · 1 · n (Sch¨olkopf and Smola, 2002; Steinwart and Christmann, 2008). H Recently, an RKHS representation of a probability measure P ( ), called kernel ∈ M1 X mean, mP := EX P [k( , X)] (Smola et al., 2007; Fukumizu et al., 2013), or equiva- ∼ · ∈ H lently, m (x)= k(x,y)dP (y), x (1) P ∈X Z c 2016 Yu Nishiyama and Kenji Fukumizu. Nishiyama and Fukumizu has been used to handle probability measures in RKHSs. The kernel mean enables us to introduce a similarity and distance between two probability measures P,Q ( ), via ∈ M1 X the RKHS inner product mP ,mQ and the norm mP mQ , respectively. Using h i || − ||H these quantities, different authors haveH proposed many algorithms, including density es- timations (Smola et al., 2007; Song et al., 2008; McCalman et al., 2013), hypothesis tests (Gretton et al. 2012, Gretton et al. 2008, Fukumizu et al. 2008), kernel Bayesian inference (Song et al. 2009, Song et al. 2010, Song et al. 2011, Fukumizu et al. 2013, Song et al. 2013, Kanagawa et al. 2016, Nishiyama et al. 2016), classification (Muandet et al., 2012), dimen- sion reduction (Fukumizu and Leng, 2012), and reinforcement learning (Gr¨unew¨alder et al. 2012, Nishiyama et al. 2012, Rawlik et al. 2013, Boots et al. 2013). In these applications, the characteristic property of a p.d. kernel k is important: a p.d. kernel is said to be characteristic if any two probability measures P,Q ( ) can be ∈ M1 X distinguished by their kernel means m ,m (Fukumizu et al., 2004; Sriperumbudur et al., P Q ∈H 2010, 2011). For a continuous, bounded, and shift-invariant p.d. kernel on Rd with k(x,y)= κ(x y), a necessary and sufficient condition for the kernel to be characteristic is known − via the Bochner theorem (Sriperumbudur et al., 2010, Theorem 9). As the first contribution of this paper, we show, using the L´evy–Khintchine formula (Sato, 1999; F. W. Steutel, 2004; Applebaum, 2009), that if κ is a continuous, bounded, and symmetric pdf of an infinitely divisible distribution P on Rd, then k is a characteristic p.d. kernel. We call such kernels convolutionally infinitely divisible (CID) kernels. Examples of CID kernels are given in Example 3.4. In addition, we note some closure properties of the CID kernels with respect to addition, pointwise product, and convolution. To describe the second contribution, we briefly explain what is essentially computed in kernel mean algorithms. In general kernel methods, the following computations are fundamental: (i) RKHS function values: f(x) for f , x , ∈H ∈X (ii) RKHS inner products: f, g , f, g . h iH ∈H If f is represented by f := n w k( , X ), w Rn, then the function value (i) f(x)= ∈H i=1 i · i ∈ n w k(x, X ) reduces to the evaluation of the kernel k(x,y). Similarly, if two RKHS i=1 i i P functions f, g are both represented by f := n w k( , X ) and g := l w˜ k( , X˜ ), P ∈H i=1 i · i j=1 j · j respectively, then the inner product (ii) f, g = n l w w˜ k(X , X˜ ) reduces to the h i P i=1 j=1 i j i Pj evaluation of the kernel k(x,y), which is so-calledH the kernel trick k( ,x), k( ,y) = k(x,y). P P h · · iH n We consider a more general case in which f, g are represented by f := w m ∈H i=1 i Pi and g := l w˜ m , respectively, where m , m are kernel means of probabil- j=1 j Qj { Pi } { Qj }⊂H P ity measures Pi , Qj 1( ). Kernel algorithms involving kernel means use this type P { } { } ⊂ M X 1 of RKHS functions explicitly or implicitly. If Pi , Qj are delta measures δX , δ ˜ , { } { } { i } { Xj } then these functions are specialized to the above kernel trick case, where mδx = k( ,x). The n n l · computation of (i) f(x) = i=1 wimPi (x) and (ii) f, g = i=1 j=1 wiw˜j mPi ,mQj h i h iH requires the following kernel mean evaluations:2 H P P P 1. A probability measure δx( ), x is a delta measure; if x B, then δx(B) = 1; otherwise, δx(B) = 0 · ∈ X ∈ for B ( ). ∈ B X nP ˙ 2. If kernel means mP ,mQ are also both expressed by a weighted sum, mP := Pi=1 ηik( , Xi) and mQ := nQ ¨ ˙ ¨ · P η˜ik( , Xi), Xi , Xi , then the computation also reduces to the above kernel trick case. i=1 · { } { }⊂X 2 Characteristic Kernels and Infinitely Divisible Distributions (iii) kernel mean values: m (x) for P ( ), x , P ∈ M1 X ∈X (iv) kernel mean inner products: mP ,mQ , P,Q 1( ). h iH ∈ M X Note that the kernel mean value (1) and the kernel mean inner product m ,m = h P Qi k(x,y)dP (x)dQ(y) involve an integral, and their rigorous computation is not tractableH in general. R The second contribution of this paper is to provide some classes of p.d. kernels and parametric models P,Q := P θ Θ such that the kernel computation of (iii) and ∈ PΘ { θ| ∈ } (iv) can be reduced to a kernel evaluation, where tractable computation is considered. For a shift-invariant kernel k(x,y) = κ(x y), x,y Rd on Rd, as shown in Lemma 2.5, the − ∈ computation of (iii) and (iv) reduces to the following convolution: (iii)’ kernel mean values: m (x) = (κ P )(x), P ∗ (iv)’ kernel mean inner products: mP ,mQ = (κ P˜ Q)(0) = (κ P Q˜)(0), h iH ∗ ∗ ∗ ∗ where P˜ and Q˜ are the dual of P and Q, respectively.3 This convolution representation motivates us to explore a set of parametric distributions that is closed under convolution, PΘ namely, a convolution semigroup ( , ) (Rd), where κ is a density function in . PΘ ∗ ⊂ M1 PΘ To illustrate the basic idea, let us consider Gaussian distributions as a parametric PΘ class, which is closed under convolution, and a Gaussian kernel. For simplicity, we consider 2 2 2 the case of scalar variance matrices σ Id. Let Nd(µ,σ Id) and fd(x µ,σ Id) denote the d- | 2 dimensional Gaussian distribution with mean µ and variance-covariance matrix σ Id, and its 2 2 pdf, respectively. If P and Q are Gaussian distributions Nd(µP ,σP Id) and Nd(µQ,σQId), respectively, and k is given by the pdf f (x y 0,τ 2I), it is easy to see that m (x) = d − | P f (x µ , (σ2 + τ 2)I ) and m ,m = f (µ µ , (σ2 + σ2 + τ 2)I ). The kernel mean d | P P d h P Qi d P | Q P Q d value and inner product are thus reducedH to simply evaluating Gaussian pdfs, which are given by a parameter update following a specific rule. This type of computation appears in various applications: to list a few, Muandet et al. (2012) proposed a support measure classification by considering kernels k(P,Q) between two input probability measures P,Q, including Gaussian models; Song et al. (2008) and McCalman et al. (2013) considered an approximation of a (target) probability measure P with a Gaussian mixture Pθ, via an optimization problem θˆ = argmin m m 2 .