Hypothesis Testing Using Pairwise and Associated Kernels

Dino Sejdinovic? [email protected] Arthur Gretton?,†,∗ [email protected] Bharath Sriperumbudur?,∗ [email protected] Kenji Fukumizu‡ [email protected] ?Gatsby Computational Neuroscience Unit, CSML, University College London, †Max Planck Institute for Intel- ligent Systems, Tübingen, ‡The Institute of Statistical Mathematics, Tokyo

Abstract distribution, or from different distributions) can be accomplished using a so-called energy as a We provide a unifying framework linking two (Székely & Rizzo, 2004; 2005). Such tests are classes of used in two- and consistent against all alternatives as long as the ran- independence testing: on the one hand, the dom variables have finite first moments. A related de- energy distances and distance pendence measure between vectors of high dimension from the statistics literature; on the other, is the distance (Székely et al., 2007; Székely distances between embeddings of distribu- & Rizzo, 2009), and the resulting test is again consis- tions to reproducing kernel Hilbert spaces tent for variables with bounded first . The dis- (RKHS), as established in . tance covariance has had a major impact in the statis- The equivalence holds when energy distances tics community, with Székely & Rizzo(2009) being are computed with semimetrics of negative accompanied by an editorial introduction and discus- type, in which case a kernel may be defined sion. A particular advantage of -based such that the RKHS distance between dis- statistics is their compact representation in terms of tributions corresponds exactly to the energy certain expectations of pairwise Euclidean distances, distance. We determine the class of proba- which leads to straightforward empirical estimates. As bility distributions for which kernels induced a follow-up work, Lyons(2011) generalized the notion by semimetrics are characteristic (that is, for of distance covariance to spaces of negative type which embeddings of the distributions to an (of which Euclidean spaces are a special case). RKHS are injective). Finally, we investigate the performance of this family of kernels in On the machine learning side, two-sample tests have two-sample and independence tests: we show been formulated based on embeddings of probability in particular that the energy distance most distributions into reproducing kernel Hilbert spaces commonly employed in statistics is just one (Gretton et al., 2012), using as the test statistic the member of a parametric family of kernels, difference between these embeddings: this statistic is and that other choices from this family can called the maximum discrepancy (MMD). This yield more powerful tests. distance measure was applied to the problem of test- ing for independence, with the associated test statis- tic being the Hilbert-Schmidt Independence Criterion 1. Introduction (HSIC) (Gretton et al., 2005a; 2008; Smola et al., 2007; Zhang et al., 2011). Both tests are shown to be con- The problem of testing statistical hypotheses in high sistent against all alternatives when a characteristic dimensional spaces is particularly challenging, and has RKHS is used (Fukumizu et al., 2008; Sriperumbudur been a recent focus of considerable work in the statis- et al., 2010). Such tests can further be generalized to tics and machine learning communities. On the statis- structured and non-Euclidean domains, such as text tical side, two-sample testing in Euclidean spaces (of strings, graphs or groups (Fukumizu et al., 2009). whether two independent samples are from the same Despite their striking similarity, the link between en- ∗ These authors contributed equally. ergy distance-based tests and kernel-based tests has Appendix available at arxiv.org/abs/1205.0411. been an open question. In the discussion of Székely & Appearing in Proceedings of the 29 th International Confer- Rizzo(2009), Gretton et al.(2009b, p. 1289) first ex- ence on Machine Learning, Edinburgh, Scotland, UK, 2012. Copyright 2012 by the author(s)/owner(s). Hypothesis Testing Using Pairwise Distances and Associated Kernels plored this link in the context of independence testing, kernel family. and stated that interpreting the distance-based inde- pendence statistic as a kernel statistic is not straight- 2. Definitions and Notation forward, since Bochner’s theorem does not apply to the choice of weight function used in the definition of In this section, we introduce concepts and notation Brownian distance covariance (we briefly review this required to understand reproducing kernel Hilbert argument in Section A.3 of the Appendix). Székely & spaces (Section 2.1), and distribution embeddings into Rizzo(2009, Rejoinder, p. 1303) confirmed this con- RKHS. We then introduce semimetrics (Section 2.2), clusion, and commented that RKHS-based dependence and review the relation of semimetrics of negative type measures do not seem to be formal extensions of Brow- to RKHS kernels. nian distance covariance because the weight function is not integrable. Our contribution resolves this question 2.1. RKHS Definitions and shows that RKHS-based dependence measures are precisely the formal extensions of Brownian distance Unless stated otherwise, we will assume that Z is any covariance, where the problem of non-integrability of topological space. weight functions is circumvented by using translation- Definition 1. (RKHS) Let H be a variant kernels, i.e., distance-induced kernels, a novel of real-valued functions defined on Z. A function family of kernels that we introduce in Section 2.2. k : Z × Z → R is called a reproducing kernel of H In the case of two-sample testing, we demonstrate that if (i) ∀z ∈ Z, k(·, z) ∈ H, and (ii) ∀z ∈ Z, ∀f ∈ energy distances are in fact maximum mean discrepan- H, hf, k(·, z)iH = f(z). If H has a reproducing kernel, cies arising from the same family of distance-induced it is called a reproducing kernel Hilbert space (RKHS). kernels. A number of interesting consequences arise According to the Moore-Aronszajn theorem (Berlinet from this insight: first, we show that the energy dis- & Thomas-Agnan, 2004, p. 19), for every symmetric, tance (and distance covariance) derives from a partic- positive definite function k : Z × Z → , there is ular parameter choice from a larger family of kernels: R an associated RKHS H of real-valued functions on Z this choice may not yield the most sensitive test. Sec- k with reproducing kernel k. The map ϕ : Z → H , ond, results from Gretton et al.(2009a); Zhang et al. k ϕ : z 7→ k(·, z) is called the canonical feature map (2011) may be applied to get consistent two-sample or the Aronszajn map of k. We will say that k is a and independence tests for the energy distance, with- nondegenerate kernel if its Aronszajn map is injective. out using bootstrap, which perform much better than the upper bound proposed by Székely et al.(2007) as 2.2. Semimetrics of Negative Type an alternative to the bootstrap. Third, in relation to Lyons(2011), we obtain a new family of characteristic We will work with the notion of semimetric of nega- kernels arising from semimetric spaces of negative type tive type on a non-empty set Z, where the “distance” (where the triangle inequality need not hold), which function need not satisfy the triangle inequality. Note are quite unlike the characteristic kernels defined via that this notion of semimetric is different to that which Bochner’s theorem (Sriperumbudur et al., 2010). arises from the seminorm, where distance between two distinct points can be zero (also called pseudonorm). The structure of the paper is as follows: In Section 2, we provide the necessary definitions from RKHS Definition 2. (Semimetric) Let Z be a non-empty theory, and the relation between RKHS and semimet- set and let ρ : Z × Z → [0, ∞) be a function such rics of negative type. In Section 3.1, we review both that ∀z, z0 ∈ Z, (i) ρ(z, z0) = 0 if and only if z = z0, the energy distance and distance covariance. We re- and (ii) ρ(z, z0) = ρ(z0, z). Then (Z, ρ) is said to be a late these quantities in Sections 3.2 and 3.3 to the semimetric space and ρ is called a semimetric on Z. If, Maximum Mean Discrepancy (MMD) and the Hilbert- in addition, (iii) ∀z, z0, z00 ∈ Z, ρ(z0, z00) ≤ ρ(z, z0) + Schmidt Independence Criterion (HSIC), respectively. ρ(z, z00), (Z, ρ) is said to be a and ρ is We give conditions for these quantities to distinguish called a metric on Z. between probability measures in Section4, thus ob- Definition 3. (Negative type) The semimetric taining a new family of characteristic kernels. Empir- space (Z, ρ) is said to have negative type if ∀n ≥ 2, ical estimates of these quantities and associated two- n z , . . . , z ∈ Z, and α , . . . , α ∈ with P α = 0, sample and independence tests are described in Sec- 1 n 1 n R i=1 i tion5. Finally, in Section6, we investigate the per- n n X X formance of the test statistics on a variety of testing αiαjρ(zi, zj) ≤ 0. (1) problems, which demonstrate the strengths of the new i=1 j=1 Hypothesis Testing Using Pairwise Distances and Associated Kernels

Note that in the terminology of Berg et al.(1984), ρ 2.3. Kernels Inducing Semimetrics satisfying (1) is said to be a negative definite function. We now further develop the link between semimetrics The following theorem is a direct consequence of Berg of negative type and kernels. Let k be any nonde- et al.(1984, Proposition 3.2, p. 82). generate reproducing kernel on Z (for example, every Proposition 4. ρ is a semimetric of negative type strictly positive definite k is nondegenerate). Then, by if and only if there exists a Hilbert space H and an Proposition4, injective map ϕ : Z → H, such that 0 0 0 0 0 0 2 ρ(z, z ) = k(z, z) + k(z , z ) − 2k(z, z ) (3) ρ(z, z ) = kϕ(z) − ϕ(z )kH (2) 2 defines a valid semimetric ρ of negative type on Z. This shows that (Rd, k· − ·k ) is of negative type. We will say that k generates ρ. It is clear that every From Berg et al.(1984, Corollary 2.10, p. 78), we have ˜ ˜ that: distance kernel k ∈ Kρ also generates ρ, and that k can be expressed as: Proposition 5. If ρ satisfies (1), then so does ρq, for ˜ 0 0 0 0 < q < 1. k(z, z ) = k(z, z ) + k(z0, z0) − k(z, z0) − k(z , z0), (4) Therefore, by taking q = 1/2, we conclude that all for some z0 ∈ Z. In addition, k ∈ Kρ if and only if Euclidean spaces are of negative type. While Lyons k(z0, z0) = 0 for some z0 ∈ Z. Hence, it is clear that (2011, p. 9) also uses the result in Proposition4, he any strictly positive definite kernel, e.g., the Gaussian 2 studies embeddings to general Hilbert spaces, and the −σkz−z0k relation with the theory of reproducing kernel Hilbert kernel e , is not a distance kernel. d 0 spaces is not exploited. Semimetrics of negative type Example 8. Let Z = R and write ρq(z, z ) = 0 q and symmetric positive definite kernels are in fact kz − z k . By combining Propositions4 and5, ρq is a closely related, as summarized in the following Lemma valid semimetric of negative type for 0 < q ≤ 2. It is based on Berg et al.(1984, Lemma 2.1, p. 74). a metric of negative type if q ≤ 1. The corresponding distance kernel “centered at zero” is given by Lemma 6. Let Z be a nonempty set, and let ρ be a semimetric on Z. Let z ∈ Z, and denote k(z, z0) = 0 0 1 q 0 q 0 q 0 0 k (z, z ) = kzk + kz k − kz − z k . (5) ρ(z, z0)+ρ(z , z0)−ρ(z, z ). Then k is positive definite q 2 if and only if ρ satisfies (1). Example 9. Let Z = Rd, and consider the Gaussian We call the kernel k defined above the distance-induced 0 2 0 −σkz−z k kernel, and say that it is induced by the semimetric kernel k(z, z ) = e . The induced semimetric h 0 2 i ρ. For brevity, we will drop “induced” hereafter, and is ρ(z, z0) = 2 1 − e−σkz−z k . There are many other say that k is simply the distance kernel (with some kernels that generate ρ, however; for example, the dis- abuse of terminology). In addition, we will typically tance kernel induced by ρ and “centered at zero” is 0 2 2 0 2 work with distance kernels scaled by 1/2. Note that k˜(z, z0) = e−σkz−z k + 1 − e−σkzk − e−σkz k . k(z0, z0) = 0, so distance kernels are not strictly pos- itive definite (equivalently, k(·, z ) = 0). By vary- 0 3. Distances and Covariances ing “the point at the center” z0, one obtains a fam-  1 0 0 ily Kρ = [ρ(z, z0) + ρ(z , z0) − ρ(z, z )] of dis- 2 z0∈Z In this section, we begin with a description of the en- tance kernels induced by ρ. We may now express (2) ergy distance, which measures distance between dis- from Proposition4 in terms of the canonical feature tributions; and distance covariance, which measures map for the RKHS Hk (proof in Appendix A.1). dependence. We then demonstrate that the former is Proposition 7. Let (Z, ρ) be a semimetric space of a special instance of the maximum mean discrepancy negative type, and k ∈ Kρ. Then: (a kernel measure of distance on distributions), and the latter an instance of the Hilbert-Schmidt Indepen- 1. k is nondegenerate, i.e., the Aronszajn map z 7→ dence criterion (a kernel dependence measure). We k(·, z) is injective. will denote by M(Z) the set of all finite signed Borel measures on Z, and by M1 (Z) the set of all Borel 2. ρ(z, z0) = k(z, z) + k(z0, z0) − 2k(z, z0) = + probability measures on Z. kk(·, z) − k(·, z0)k2 . Hk 3.1. Energy Distance and Distance Covariance Note that Proposition7 implies that the Aronszajn map z 7→ k(·, z) is an isometric embedding of a metric Székely & Rizzo(2004; 2005) use the following measure 1/2 space (Z, ρ ) into Hk, for every k ∈ Kρ. of between two probability mea- Hypothesis Testing Using Pairwise Distances and Associated Kernels sures P and Q on Rd, termed the energy distance: Alternatively, the kernel embedding can be defined by the Bochner expectation µk(P ) = EZ∼P k(·,Z). By 0 DE(P,Q) = 2EZW kZ − W k − EZZ0 kZ − Z k the Riesz representation theorem, a sufficient condi- 0 −EWW 0 kW − W k , (6) tion for the existence of µk(P ) is that k is Borel- 1/2 measurable and that EZ∼P k (Z,Z) < ∞. If k is where Z,Z0 i.i.d.∼ P and W, W 0 i.i.d.∼ Q. This quantity a bounded continuous function, this is obviously true 1 characterizes the equality of distributions, and in the for all P ∈ M+(Z). Kernel embeddings can be used to scalar case, it coincides with twice the Cramer-Von induce metrics on the spaces of probability measures, Mises distance. We may generalize it to a semimetric giving the maximum mean discrepancy (MMD), space of negative type (Z, ρ), with the expression for 2 2 γk(P,Q) = kµk(P ) − µk(Q)kH this generalized distance covariance DE,ρ(P,Q) being k 0 0 of the same form as (6), with the Euclidean distance = EZZ0 k(Z,Z ) + EWW 0 k(W, W ) replaced by ρ. Note that the negative type of ρ im- −2EZW k(Z,W ), (8) plies the non-negativity of DE,ρ. In Section 3.2, we 0 i.i.d. 0 i.i.d. will show that for every ρ, DE,ρ is precisely the MMD where Z,Z ∼ P and W, W ∼ Q. If the re- 1 associated to a particular kernel k on Z. striction of µk to some P(Z) ⊆ M+(Z) is well de- fined and injective, then k is said to be characteristic Now, let X be a random vector on p and Y a ran- R to P(Z), and it is said to be characteristic (without dom vector on q. The distance covariance was in- R further qualification) if it is characteristic to M1 (Z). troduced in Székely et al.(2007); Székely & Rizzo + When k is characteristic, γ is a metric on M1 (Z), i.e., (2009) to address the problem of testing and mea- k + γ (P,Q) = 0 iff P = Q, ∀P,Q ∈ M1 (Z). Conditions suring dependence between X and Y , in terms of a k + under which kernels are characteristic have been stud- weighted L -distance between characteristic functions 2 ied by Sriperumbudur et al.(2008); Fukumizu et al. of the joint distribution of X and Y and the product (2009); Sriperumbudur et al.(2010). An alternative of their marginals. Given a particular choice of weight interpretation of (8) is as an integral probability met- function, it can be computed in terms of certain ex- ric (Müller, 1997): see Gretton et al.(2012) for details. pectations of pairwise Euclidean distances, In general, distance kernels are continuous but un- 2 0 0 V (X,Y ) = EXY EX0Y 0 kX − X k kY − Y k (7) bounded functions. Thus, kernel embeddings are not 0 0 +EX EX0 kX − X k EY EY 0 kY − Y k defined for all Borel probability measures, and one 0 0 needs to restrict the attention to a class of Borel proba- 0 0 −2EX Y [EX kX − X k EY kY − Y k] , 1/2 bility measures for which EZ∼P k (Z,Z) < ∞ when discussing the maximum mean discrepancy. We will where (X,Y ) and (X0,Y 0) are i.i.d.∼ P . Recently, XY assume that all Borel probability measures considered Lyons(2011) established that the generalization of the satisfy a stronger condition that k(Z,Z) < ∞ distance covariance is possible to metric spaces of neg- EZ∼P (this reflects a finite first moment condition on random ative type, with the expression for this generalized dis- variables considered in distance covariance tests, and tance covariance V2 (X,Y ) being of the same form ρX ,ρY will imply that all quantities appearing in our results as (7), with Euclidean distances replaced by metrics of are well defined). For more details, see Section A.4 negative type ρ and ρ on domains X and Y , respec- X Y in the Appendix. As an alternative to requiring this tively. In Section 3.3, we will show that the general- condition, one may assume that the underlying semi- ized distance covariance of a pair of random variables metric space (Z, ρ) of negative type is itself bounded, X and Y is precisely HSIC associated to a particular 0 i.e., that sup 0 ρ(z, z ) < ∞. kernel k on the product of domains of X and Y . z,z ∈Z We are now able to describe the relation between the 3.2. Maximum Mean Discrepancy maximum mean discrepancy and the energy distance. The following theorem is a consequence of Lemma6, The notion of the feature map in an RKHS (Section and is proved in Section A.1 of the Appendix. 2.1) can be extended to kernel embeddings of probabil- ity measures (Berlinet & Thomas-Agnan, 2004; Sripe- Theorem 11. Let (Z, ρ) be a semimetric space of neg- ative type and let z0 ∈ Z. The distance kernel k in- rumbudur et al., 2010). 2 1 duced by ρ satisfies γk(P,Q) = DE,ρ(P,Q). In par- Definition 10. (Kernel embedding) Let k be a ker- 2 ticular, γk does not depend on the choice of z0. 1 nel on Z, and P ∈ M+(Z). The kernel embedding of P into the RKHS Hk is µk(P ) ∈ Hk such that There is a subtlety to the link between kernels and f(Z) = hf, µ (P )i for all f ∈ H . semimetrics, when used in computing the distance on EZ∼P k Hk k Hypothesis Testing Using Pairwise Distances and Associated Kernels probabilities. Consider again the family of distance We remark that a similar result to Theorem 12 is given kernels Kρ, where the semimetric ρ is itself generated by Lyons(2011, Proposition 3.16), but without mak- from k according to (3). As we have seen, it may ing use of the RKHS equivalence. Theorem 12 is a 2 be that k∈ / Kρ, however it is clear that γk(P,Q) = more general statement, in the sense that we allow ρ 1 2 DE,ρ(P,Q) whenever k generates ρ. Thus, all ker- to be a semimetric of negative type, rather than a met- nels that generate the same semimetric ρ on Z give ric. In addition to yielding a more general statement, rise to the same metric γk on (possibly a subset of) the RKHS equivalence leads to a significantly simpler 1 M+(Z), and γk is merely an extension of the met- proof: the result is an immediate application of the ric ρ1/2 on the point masses. The kernel-based and HSIC expansion of Smola et al.(2007). distance-based methods are therefore equivalent, pro- vided that we allow “distances” ρ which may not satisfy 4. Distinguishing Probability the triangle inequality. Distributions 3.3. The Hilbert-Schmidt Independence Lyons(2011, Theorem 3.20) shows that distance co- Criterion in a metric space characterizes independence if the metrics satisfy an additional property, termed Given a pair of jointly observed random variables strong negative type. We will extend this notion to (X,Y ) with values in X × Y, the Hilbert-Schmidt In- 1 a semimetric ρ. We will say that P ∈ M+(Z) has dependence Criterion (HSIC) is computed as the max- a finite first moment w.r.t. ρ if ρ(z, z )dP is finite imum mean discrepancy between the joint distribution 0 for some z0 ∈ Z. It is easy to see´ that the integral PXY and the product of its marginals PX PY . Let kX ρ d ([P − Q] × [P − Q]) = −D (P,Q) converges and k be kernels on X and Y, with respective RKHSs E,ρ Y ´whenever P and Q have finite first moments w.r.t. H and H . Following Smola et al.(2007, Section kX kY ρ. In Appendix A.4, we show that this condition is 2.3), we consider the MMD associated to the kernel 0 0 0 0 equivalent to EZ∼P k(Z,Z) < ∞, for a kernel k that k ((x, y) , (x , y )) = kX (x, x )kY (y, y ) on X × Y with generates ρ, which implies the kernel embedding µk(P ) RKHS Hk isometrically isomorphic to the tensor prod- is also well defined. uct H ⊗ H . It follows that θ := γ2(P ,P P ) kX kY k XY X Y Definition 13. The semimetric space (Z, ρ) is said with 1 to have a strong negative type if ∀P,Q ∈ M+(Z) with finite first moment w.r.t. ρ,

θ = EXY [kX (·,X) ⊗ kY (·,Y )] 2 P 6= Q ⇒ ρ d ([P − Q] × [P − Q]) < 0. (10) ˆ −EX kX (·,X) ⊗ EY kY (·,Y ) HkX ⊗HkY 2 0 0 The quantity in (10) is exactly −2γ (P,Q) for all P,Q = EXY EX0Y 0 kX (X,X )kY (Y,Y ) k 0 0 with finite first moment w.r.t. ρ. We directly obtain: +EX EX0 kX (X,X )EY EY 0 kY (Y,Y ) 0 0 Proposition 14. Let kernel k generate ρ. Then (Z, ρ) −2EX0Y 0 [EX kX (X,X )EY kY (Y,Y )] , has a strong negative type if and only if k is character- istic to all probability measures with finite first moment where in the last step we used that w.r.t. ρ. hf ⊗ g, f 0 ⊗ g0i = hf, f 0i hg, g0i . HkX ⊗HkY HkX HkX It can be shown that this quantity is the squared Thus, the problems of checking whether a semimetric Hilbert-Schmidt norm of the covariance operator be- is of strong negative type and whether its associated tween RKHSs (Gretton et al., 2005b). The following kernel is characteristic to an appropriate space of Borel theorem demonstrates the link between HSIC and the probability measures are equivalent. This conclusion distance covariance, and is proved in Appendix A.1. has some overlap with Lyons(2011): in particular, Proposition 14 is stated in Lyons(2011, Proposition Theorem 12. Let (X , ρX ) and (Y, ρY ) be semimetric 3.10), where the barycenter map β is a kernel embed- spaces of negative type, and (x0, y0) ∈ X × Y. Define ding in our terminology, although Lyons does not con- sider distribution embeddings in an RKHS. k ((x, y) , (x0, y0)) 0 0 := [ρX (x, x0) + ρX (x , x0) − ρX (x, x )] × 5. Empirical Estimates and Hypothesis 0 0 [ρY (y, y0) + ρY (y , y0) − ρY (y, y )] . (9) Tests Then, k is a positive definite kernel on X × Y, and In the case of two-sample testing, we are given i.i.d. γ2(P ,P P ) = V2 (X,Y ). samples z = {z }m ∼ P and w = {w }n ∼ Q. The k XY X Y ρX ,ρY i i=1 i i=1 Hypothesis Testing Using Pairwise Distances and Associated Kernels empirical (biased) V-statistic estimate of (8) is converge to a particular weighted sum of chi-squares m m n n of form similar to that found for the kernel-based 1 X X 1 X X γˆ2 (z, w) = k(z , z ) + k(w , w ) statistics. Analogous results for the generalized dis- k,V m2 i j n2 i j i=1 j=1 i=1 j=1 tance covariance are presented by Lyons(2011, p. 7– m n 8). These works do not propose test designs that at- 2 X X − k(z , w ). (11) tempt to estimate the coefficients in such represen- mn i j i=1 j=1 tations of the null distribution, however (note also Recall that if we use a distance kernel k induced by a that these coefficients have a more intuitive interpre- semimetric ρ, this estimate involves only the pairwise tation using kernels). Besides the bootstrap, Székely ρ-distances between the sample points. et al.(2007, Theorem 6) also proposes an independence test using a bound applicable to a general quadratic In the case of independence testing, we are given i.i.d. m form Q of centered Gaussian random variables with samples z = {(x , y )} ∼ P , and the resulting  −1 2 i i i=1 XY E[Q] = 1: P Q ≥ Φ (1 − α/2) ≤ α, valid for V-statistic estimate (HSIC) is (Gretton et al., 2005a; 0 < α ≤ 0.215. When applied to the distance co- 2008) variance statistic, the upper bound of α is achieved 1 if X and Y are independent Bernoulli variables. The HSIC(z; k , k ) = T r(K HK H), (12) X Y m2 X Y authors remark that the resulting criterion might be over-conservative. Thus, more sensitive tests are pos- where KX , KY and H are m × m matrices given sible by computing the spectrum of the centred Gram by (KX )ij := kX (xi, xj), (KY )ij := kY (yi, yj) and 1 matrices associated to distance kernels, and we pursue Hij = δij − (centering matrix). As in the two- m this approach in the next section. sample case, if both kX and kY are distance kernels, the test statistic involves only the pairwise distances between the samples, i.e., kernel matrices in (12) may 6. be replaced by distance matrices. 6.1. Two-sample Experiments We would like to design distance-based tests with an asymptotic Type I error of α, and thus we require an In the two-sample experiments, we investigate three estimate of the (1 − α)-quantile of the V-statistic dis- different kinds of synthetic . In the first, we com- tribution under the null hypothesis. Under the null pare two multivariate Gaussians, where the dif- hypothesis, both (11) and (12) converge to a particular fer in one dimension only, and all are equal. weighted sum of chi-squared distributed independent In the second, we again compare two multivariate random variables (for more details, see Section A.2). Gaussians, but this time with identical means in all We investigate two approaches, both of which yield dimensions, and variance that differs in a single dimen- consistent tests: a bootstrap approach (Arcones & sion. In our third , we use the benchmark Giné, 1992), and a spectral approach (Gretton et al., data of Sriperumbudur et al.(2009): one distribution 2009a; Zhang et al., 2011). The latter requires em- is a univariate Gaussian, and the second is a univari- pirical computation of the spectrum of kernel integral ate Gaussian with a sinusoidal perturbation of increas- operators, a problem studied extensively in the con- ing (where higher frequencies correspond to text of kernel PCA (Schölkopf et al., 1997). In the harder problems). All tests use a distance kernel in- two-sample case, one computes the eigenvalues of the duced by the Euclidean distance. As shown on the left centred Gram matrix K˜ = HKH on the aggregated plots in Figure1, the spectral and bootstrap test de- samples. Here, K is a 2m × 2m matrix, with entries signs appear indistinguishable, and they significantly outperform the test designed using the quadratic form Kij = k(ui, uj), u = [z w] is the concatenation of the two samples and H is the centering matrix. Gret- bound, which appears to be far too conservative for ton et al.(2009a) show that the null distribution de- the data sets considered. This is confirmed by check- fined using these finite sample estimates converges to ing the Type I error of the quadratic form test, which the population distribution, provided that the spec- is significantly smaller than the test size of α = 0.05. trum is square-root summable. The same approach We also compare the performance to that of the Gaus- can be used for a consistent finite sample null distri- sian kernel, with the bandwidth set to the dis- bution of HSIC, via computation of the eigenvalues of tance between points in the aggregation of samples. ˜ ˜ KX = HKX H and KY = HKY H (Zhang et al., 2011). We see that when the means differ, both tests perform Both Székely & Rizzo(2004, p. 14) and Székely et al. similarly. When the variances differ, it is clear that the (2007, p. 2782–2783) establish that the energy dis- Gaussian kernel has a major advantage over the dis- tance and distance covariance statistics, respectively, tance kernel, although this advantage decreases with Hypothesis Testing Using Pairwise Distances and Associated Kernels

Means differ, α=0.05 Means differ, α=0.05 m=1024, d=4, α=0.05 m=512, α=0.05 1 0.8 1 1 Spectral −dist dist, q=1/3 dist, q=1/6 0.9 0.9 Bootstrap −dist 0.7 dist, q=2/3 0.9 dist, q=1/3 Qform −dist dist, q=1 dist, q=2/3 0.8 Spectral −gauss dist, q=4/3 dist, q=1 0.8 Bootstrap −gauss 0.6 dist, q=5/3 0.8 dist, q=4/3 0.7 Qform −gauss dist, q=2 gauss (median) 0.7 0.5 gauss (median) 0.7 0.6 0.6

0.5 0.4 0.6 0.5

0.4 0.4 0.3 0.5 Type II error Type II error Type II error Type II error 0.3 0.3 dist, q=1/6 0.2 0.4 dist, q=1/3 0.2 0.2 dist, q=2/3 dist, q=1 0.1 0.3 0.1 0.1 dist, q=4/3 gauss (median) 0 0 0.2 0 1 1.5 2 2.5 3 3.5 4 4.5 5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.2 0.4 0.6 0.8 1 1 1.5 2 2.5 3 3.5 4 4.5 5 log (dimension) log (dimension) × π 2 2 angle of rotation ( /4) frequency

Variances differ, α=0.05 Variances differ, α=0.05 1 1 Figure 2. HSIC using distance kernels with various expo- 0.9 0.9 0.8 0.8 nents and a Gaussian kernel as a function of (left) the angle 0.7 0.7 of rotation for the dependence induced by rotation; (right) 0.6 0.6

0.5 0.5 frequency ` in the sinusoidal dependence example.

0.4 0.4 Type II error Type II error 0.3 0.3 the distributions differ at higher frequencies). This ob- 0.2 0.2 0.1 0.1 servation also appears to hold true on the real-world 0 0 1 1.5 2 2.5 3 3.5 4 4.5 5 1 1.5 2 2.5 3 3.5 4 4.5 5 log (dimension) log (dimension) data experiments in Appendix A.6. 2 2

Sinusoidal perturbation, α=0.05 Sinusoidal perturbation, α=0.05 1 1

0.9 0.9 6.2. Independence Experiments

0.8 0.8 0.7 0.7 To assess independence tests, we used an artificial 0.6 0.6

0.5 0.5 benchmark proposed by Gretton et al.(2008): we

0.4 0.4

Type II error Type II error generate univariate random variables from the ICA 0.3 0.3

0.2 0.2 benchmark densities of Bach & Jordan(2002); rotate 0.1 0.1 them in the product space by an angle between 0 and 0 0 1 1.5 2 2.5 3 3.5 4 4.5 5 1 1.5 2 2.5 3 3.5 4 4.5 5 Sinusoid frequency Sinusoid frequency π/4 to introduce dependence; fill additional dimen- sions with independent Gaussian noise; and, finally, Figure 1. (left) MMD using Gaussian and distance kernels pass the resulting multivariate data through random for various tests; (right) Spectral MMD using distance ker- and independent orthogonal transformations. The re- nels with various exponents. The number of samples in all sulting random variables X and Y are dependent but experiments was set to m = 200. uncorrelated. The case m = 1024 (sample size) and d = 4 (dimension) is plotted in Figure2 (left). As ob- increasing dimension (where both perform poorly). In served by Gretton et al.(2009b), the Gaussian kernel the case of a sinusoidal perturbation, the performance does better than the distance kernel with q = 1. By is again very similar. varying q, however, we are able to obtain a wide In addition, following Example8, we investigate the of performance; in particular, the values q = 1/6 (and performance of kernels obtained using the semimet- smaller) have an advantage over the Gaussian kernel ric ρ(z, z0) = kz − z0kq for 0 < q ≤ 2. Results are on this dataset, especially in the case of smaller an- presented in the right hand plots of Figure1. While gles of rotation. As for the two-sample case, bootstrap judiciously chosen values of q offer some improvement and spectral tests have indistinguishable performance, in the cases of differing mean and variance, we see a and are significantly more sensitive than the quadratic dramatic improvement for the sinusoidal perturbation, form based test, which failed to reject the null hypoth- compared with the case q = 1 and the Gaussian ker- esis of independence on this dataset. nel: values q = 1/3 (and smaller) yield virtually error- In addition, we assess the test performance on sinu- free performance even at high frequencies (note that soidally dependent data. The distribution over the q = 1 corresponds to the energy distance described in pair X,Y was drawn from P ∝ Székely & Rizzo(2004; 2005)). Additional experiments XY 1 + sin(`x) sin(`y) for integer `, on the support X × Y, with real-world data are presented in Appendix A.6. where X := [−π, π] and Y := [−π, π]. In this way, in- We observe from the simulation results that distance creasing ` caused the departure from a uniform (inde- kernels with higher exponents are advantageous in pendent) distribution to occur at increasing frequen- cases where distributions differ in mean value along cies, making this departure harder to detect from a a single dimension (with noise in the remainder), small sample size. Results are in Figure2 (right). whereas distance kernels with smaller exponents are We note that the distance covariance outperforms the more sensitive to differences in distributions at finer Gaussian kernel on this example, and that smaller ex- lengthscales (i.e., where the characteristic functions of ponents result in better performance (lower Type II Hypothesis Testing Using Pairwise Distances and Associated Kernels error when the departure from independence occurs rule. In NIPS, 2011. at higher frequencies). Finally, we note that the set- Gretton, A., Bousquet, O., Smola, A.J., and Schölkopf, B. ting q = 1, which is described in Székely et al.(2007); Measuring statistical dependence with Hilbert-Schmidt Székely & Rizzo(2009), is a reasonable heuristic in norms. In ALT, 2005a. practice, but does not yield the most powerful tests on Gretton, A., Herbrich, R., Smola, A., Bousquet, O., and either dataset. Schölkopf, B. Kernel methods for measuring indepen- dence. J. Mach. Learn. Res., 6:2075–2129, 2005b. Gretton, A., Fukumizu, K., Teo, C.H., Song, L., 7. Conclusion Schoelkopf, B., and Smola, A. A kernel statistical test of independence. In NIPS, 2008. We have established an equivalence between the energy Gretton, A., Fukumizu, K., Harchaoui, Z., and Sriperum- distance and distance covariance, and RKHS measures budur, B. A fast, consistent kernel two-sample test. In of distance between distributions. In particular, en- NIPS, 2009a. ergy distances and RKHS distance measures coincide Gretton, A., Fukumizu, K., and Sriperumbudur, B. Dis- when the kernel is induced by a semimetric of nega- cussion of: Brownian distance covariance. Ann. Appl. tive type. The associated family of kernels performs Stat., 3(4):1285–1294, 2009b. well in two-sample and independence testing: interest- Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., and ingly, the parameter choice most commonly used in the Smola, A. A kernel two-sample test. J. Mach. Learn. Res., 13:723–773, 2012. statistics literature does not yield the most powerful Lyons, R. Distance covariance in metric spaces. tests in many settings. arXiv:1106.5758, June 2011. The interpretation of the energy distance and distance Müller, A. Integral probability metrics and their generating covariance in an RKHS setting should be of consider- classes of functions. Adv. Appl. Probab., 29(2):429–443, 1997. able interest both to and machine learning researchers, since the associated kernels may be used Schölkopf, B., Smola, A. J., and Müller, K.-R. Kernel principal component analysis. In ICANN, 1997. much more widely: in conditional dependence testing Smola, A. J., Gretton, A., Song, L., and Schölkopf, B. A and estimates of the chi-squared distance (Fukumizu Hilbert space embedding for distributions. In ALT, 2007. et al., 2008), in (Fukumizu et al., Sriperumbudur, B. Mixture via Hilbert 2011), in mixture density estimation (Sriperumbudur, space embedding of measures. In IEEE ISIT, 2011. 2011) and in other machine learning applications. In Sriperumbudur, B., Gretton, A., Fukumizu, K., Lanckriet, particular, the link with kernels makes these applica- G., and Schölkopf, B. Injective Hilbert space embeddings tions of the energy distance immediate and straight- of probability measures. In COLT, 2008. forward. Finally, for problem settings defined most Sriperumbudur, B., Fukumizu, K., Gretton, A., Lanckriet, naturally in terms of distances, and where these dis- G., and Schoelkopf, B. Kernel choice and classifiability tances are of negative type, there is an interpretation for RKHS embeddings of probability distributions. In in terms of reproducing kernels, and the learning ma- NIPS, 2009. chinery from the kernel literature can be brought to Sriperumbudur, B., Gretton, A., Fukumizu, K., Lanckriet, G., and Schölkopf, B. Hilbert space embeddings and bear. metrics on probability measures. J. Mach. Learn. Res., 11:1517–1561, 2010. References Steinwart, I. and Christmann, A. Support Vector Ma- chines. Springer, 2008. Arcones, M. and Giné, E. On the bootstrap of U and V Székely, G. and Rizzo, M. Testing for equal distributions statistics. Ann. Stat., 20(2):655–674, 1992. in high dimension. InterStat, (5), November 2004. Bach, F. R. and Jordan, M. I. Kernel independent compo- Székely, G. and Rizzo, M. A new test for multivariate nent analysis. J. Mach. Learn. Res., 3:1–48, 2002. normality. J. Multivariate Anal., 93:58–80, 2005. Berg, C., Christensen, J. P. R., and Ressel, P. Harmonic Székely, G. and Rizzo, M. Brownian distance covariance. Analysis on Semigroups. Springer, New York, 1984. Ann. Appl. Stat., 4(3):1233–1303, 2009. Berlinet, A. and Thomas-Agnan, C. Reproducing Ker- Székely, G., Rizzo, M., and Bakirov, N.K. Measuring and nel Hilbert Spaces in Probability and Statistics. Kluwer, testing dependence by correlation of distances. Ann. 2004. Stat., 35(6):2769–2794, 2007. Fukumizu, K., Gretton, A., Sun, X., and Schoelkopf, B. Zhang, K., Peters, J., Janzing, D., and Schoelkopf, B. Kernel measures of conditional dependence. In NIPS, Kernel-based conditional independence test and appli- 2008. cation in causal discovery. In UAI, 2011. Fukumizu, K., Sriperumbudur, B., Gretton, A., and Schoelkopf, B. Characteristic kernels on groups and semigroups. In NIPS, 2009. Fukumizu, K., Song, L., and Gretton, A. Kernel Bayes’ Hypothesis Testing Using Pairwise Distances and Associated Kernels

0 0 A. Appendix θ2 = EX EX0 ρX (X,X )EY EY 0 ρY (Y,Y ) A.1. Proofs +4EX ρX (X, x0)EY ρY (Y, y0) 0 −2 ρ (X, x ) 0 ρ (Y,Y ) 0 EX X 0 EY EY Y Proof. (Proposition 7) If z, z ∈ Z are such that 0 −2 ρ (Y, y ) 0 ρ (X,X ), k(w, z) = k(w, z0), for all w ∈ Z, one would also have EY Y 0 EX EX X 0 0 ρ(z, z0) − ρ(z, w) = ρ(z , z0) − ρ(z , w), for all w ∈ Z. and In particular, by inserting w = z, and w = z0, we ob- 0 0 tain ρ(z, z0) = −ρ(z, z0) = 0, i.e., z = z0. The second θ3 = EX0Y 0 [EX ρX (X,X )EY ρY (Y,Y )] statement follows readily by expressing k in terms of +3EX ρX (X, x0)EY ρY (Y, y0) ρ. +EXY ρX (X, x0)ρY (Y, y0) 0 Proof. (Theorem 11) Follows directly by inserting −EXY [ρX (X, x0)EY 0 ρY (Y,Y )] 0 the distance kernel from Lemma6 into (8), and can- −EXY [ρY (Y, y0)EX0 ρX (X,X )] 0 celling out the terms dependant on a single random −EX ρX (X, x0)EY EY 0 ρY (Y,Y ) 2 variable. Define θ := γ (P,Q). 0 k −EY ρY (Y, y0)EX EX0 ρX (X,X ).

1 0 0 The claim now follows by inserting the resulting expan- θ = 0 [ρ(Z, z ) + ρ(Z , z ) − ρ(Z,Z )] 2EZZ 0 0 sions and cancelling the appropriate terms. Note that 1 0 0 only the leading terms in the expansions remain. + 0 [ρ(W, z ) + ρ(W , z ) − ρ(W, W )] 2EWW 0 0 Remark 15. It turns out that k is not characteristic − [ρ(Z, z ) + ρ(W, z ) − ρ(Z,W )] EZW 0 0 to M1 (X × Y) — i.e., it cannot distinguish between 0 0 + EZZ0 ρ(Z,Z ) EWW 0 ρ(W, W ) any two distributions on X × Y, even if k and k = EZW ρ(Z,W ) − − . X Y 2 2 are characteristic. However, since γk is equal to the Brownian distance covariance, we know that it can al- ways distinguish between any PXY and its product of marginals P P in the Euclidean case. Namely, Proof. (Theorem 12) First, we note that k is a X Y note that k((x , y), (x , y0)) = k((x, y ), (x0, y )) = 0 valid reproducing kernel since k ((x, y) , (x0, y0)) = 0 0 0 0 for all x, x0 ∈ X , y, y0 ∈ Y. That means that k (x, x0)k (y, y0), where we have taken k (x, x0) = X Y X for every two distinct P ,Q ∈ M1 (Y), one has ρ (x, x ) + ρ (x0, x ) − ρ (x, x0), and k (y, y0) = Y Y + X 0 X 0 X Y γ2(δ P , δ Q ) = 0. Thus, kernel in (9) charac- ρ (y, y ) + ρ (y0, y ) − ρ (y, y0), as distance kernels k x0 Y x0 Y Y 0 Y 0 Y terizes independence but not equality of probability induced by ρ and ρ , respectively. Indeed, a prod- X Y measures on the product space. Informally speaking, uct of two reproducing kernels is always a valid re- the independence testing is an easier problem than ho- producing kernel on the product space (Steinwart & mogeneity testing on the product space. Christmann, 2008, Lemma 4.6, p. 114). To show equality to distance covariance, we start by expand- 2 A.2. Spectral Tests ing θ := γk(PXY ,PX PY ), Assume that the null hypothesis holds, i.e., that P = θ1 Q. For a kernel k and a Borel probability mea- z }| 0 0{ ˜ 0 θ = XY X0Y 0 kX (X,X )kY (Y,Y ) sure P , define a kernel “centred” at P : kP (z, z ) := E E 0 0 0 k(z, z ) + 0 k(W, W ) − k(z, W ) − k(z ,W ), θ2 EWW EW EW 0 i.i.d. z }|0 0{ with W, W ∼ P . Note that as a special case for + EX EX0 kX (X,X )EY EY 0 kY (Y,Y ) P = δz0 we recover the family of kernels in (4), and θ3 ˜ 0 that EZZ0 kP (Z,Z ) = 0, i.e., µ˜ (P ) = 0. The centred z }| 0 0 { kP −2 EX0Y 0 [EX kX (X,X )EY kY (Y,Y )] . kernel is important in characterizing the null distribu- ˜ tion of the V-statistic. To the centred kernel kP on Note that domain Z, one associates the integral kernel operator 2 2 Sk˜ : LP (Z) → LP (Z) (see Steinwart & Christmann, 0 0 P θ1 = EXY EX0Y 0 ρX (X,X )ρY (Y,Y ) 2008, p. 126–127), given by: +2EX ρX (X, x0)EY ρY (Y, y0) S g(z)= k˜ (z, w)g(w) dP (w). (13) k˜P P +2EXY ρX (X, x0)ρY (Y, y0) ˆZ 0 −2EXY [ρX (X, x0)EY 0 ρY (Y,Y )] The following theorem is a special case of Gretton et al. 0 −2EXY [ρY (Y, y0)EX0 ρX (X,X )] , (2012, Theorem 12). For simplicity, we focus on the Hypothesis Testing Using Pairwise Distances and Associated Kernels case where m = n. (2009b), and is included here in the interests of com- p Theorem 16. Let Z = {Z }m and W = {W }m be pleteness). Let X be a random vector on X =R and i i=1 i i=1 q two i.i.d. samples from P ∈ M1 (Z), and let S be a Y a random vector on Y = R . The characteristic + k˜P trace class operator.Then function of X and Y , respectively, will be denoted by fX and fY , and their joint characteristic function by ∞ m 2 X 2 fXY . The distance covariance V(X,Y ) is defined via γˆ (Z, W) λiN , (14) 2 k,V i the norm of fXY − fX fY in a weighted L2 space on i=1 Rp+q, i.e., i.i.d. ∞ where Ni ∼ N (0, 1), i ∈ N, and {λi}i=1 are the 2 2 V (X,Y ) = |fX,Y (t, s) − fX (t)fY (s)| w(t, s) dt ds, eigenvalues of the operator Sk˜ . ˆ P (16) Note that this result requires that the integral ker- for a particular choice of weight function given by nel operator associated to the underlying probabil- 1 1 ity measure P is a trace class operator, i.e., that w(t, s) = · 1+p 1+q , (17) cpcq ktk ksk EZ∼P k(Z,Z) < ∞. As before, the sufficient condition for this to hold for all probability measures is that k 1+d 1+d where c = π 2 /Γ( ), d ≥ 1. An important aspect is a bounded function. In the case of a distance ker- d 2 of distance covariance is that V(X,Y ) = 0 if and only nel, this is the case if the domain Z has a bounded if X and Y are independent. We next obtain a similar diameter with respect to the semimetric ρ, i.e., that 0 statistic in the kernel setting. Write Z = X × Y, and sup 0 ρ(z, z ) < ∞. z,z ∈Z let k(z, z0) = κ(z−z0) be a translation invariant RKHS The null distribution of HSIC takes an analogous form kernel on Z, where κ : Z → R is a bounded continuous to (14) of a weighted sum of chi-squares, but with co- function. Using Bochner’s theorem, κ can be written efficients corresponding to the products of the eigen- as: −z>u values of integral operators Sk˜ and Sk˜ . The fol- κ(z) = e dΛ(u), PX PY ˆ lowing Theorem is in Zhang et al.(2011, Theorem 4) and gives an asymptotic form for the null distribution for a finite non-negative Borel measure Λ. It follows of HSIC. See also Lyons(2011, Remark 2.9). Gretton et al.(2009b) that m Theorem 17. Let Z = {(Xi,Yi)}i=1 be an i.i.d. sam- 2 2 ple from PXY = PX PY , with values in X × Y. Let γk(PXY ,PX PY ) = |fX,Y (t, s) − fX (t)fY (s)| dΛ(t, s), 2 2 2 ˆ S˜ : LP (X ) → LP (X ), and S˜ : LP (Y) → kPX X X kPY Y 2 which is in clear correspondence with (16). However, LP (Y) be trace class operators. Then Y the weight function in (17) is not integrable — so one ∞ ∞ cannot find a translation invariant kernel for which γ X X 2 k mHSIC(Z; kX , kY ) λiηjNi,j, (15) coincides with the distance covariance. By contrast, i=1 j=1 note the kernel in (9) is not translation invariant. where Ni,j ∼ N (0, 1), i, j ∈ N, are independent and ∞ ∞ A.4. Restriction on Probability Measures {λi}i=1 and {ηj}j=1 are the eigenvalues of the opera- tors S˜ and S˜ , respectively. kPX kPY In general, distance kernels and their products are con- tinuous but unbounded, so kernel embeddings are not Note that if X and Y have bounded diameters w.r.t. ρX defined for all Borel probability measures. Thus, one and ρY , Theorem 17 applies to distance kernels in- needs to restrict the attention to a particular class of 1 duced by ρX and ρY for all PX ∈ M+(X ), PY ∈ Borel probability measures for which kernel embed- 1 M+(Y) . dings exist, and a sufficient condition for this is that 1/2 EZ∼P k (Z,Z) < ∞, by the Riesz representation the- A.3. A Characteristic Function Based orem. Let k be a measurable reproducing kernel on Z, Interpretation and denote, for θ > 0, The distance covariance in (7) was defined by Székely et al.(2007) in terms of a weighted distance between   Mθ (Z) = ν ∈ M(Z): kθ(z, z) d |ν| (z) < ∞ . (18) characteristic functions. We briefly review this inter- k ˆ pretation here, however we show that this approach cannot be used to derive a kernel-based measure of de- Note that the maximum mean discrepancy γk(P,Q) 1/2 1 pendence (this result was first noted by Gretton et al. is well defined ∀P,Q ∈ Mk (Z) ∩ M+(Z). Hypothesis Testing Using Pairwise Distances and Associated Kernels

n−1 θ (n−1)/2 n−1 Now, let ρ be a semimetric of negative type. Then, 2 , i.e., Mρ(Z) ⊂ Mk (Z) for θ ≥ 2 . Let we can consider the class of probability measures that θ n ν ∈ Mρ(Z) for θ ≥ 2 . Then we have have a finite θ-moment with respect to ρ: θ θ ρ (z, z0) d|ν|(z) Mρ(Z) = {ν ∈ M(Z): ∃z0 ∈ Z, (19) ˆ 2θ θ = kk(·, z) − k(·, z )kn  n d|ν|(z) s.t. ρ (z, z ) d |ν| (z) < ∞}. 0 Hk ˆ 0 ˆ 2θ   n ≥ kk(·, z) − k(·, z )kn d|ν|(z) To ensure existence of energy distance DE,ρ(P,Q), ˆ 0 Hk 1 we need to assume that P,Q ∈ Mθ(Z), as other- 2θ 0 0   n wise expectations EZZ0 ρ(Z,Z ), EWW 0 ρ(W, W ) and ≥ |kk(·, z)k − kk(·, z )k |n d|ν|(z) ˆ Hk 0 Hk EZW ρ(Z,W ) may be undefined. The following propo- 2θ sition shows that the classes of probability measures in n n ≥ (kk(·, z)kH − kk(·, z0)kH ) d|ν|(z) (18) and (19) coincide at θ = n/2, for n ∈ N, whenever ˆ k k ρ is generated by kernel k. 2θ n   n X r n n−r r Proposition 18. Let k be a kernel that generates = (−1) kk(·, z)k kk(·, z0)k d|ν|(z) ˆ r Hk Hk semimetric ρ, and let n ∈ . Then, Mn/2(Z) = r=0 N k n/2 n Mθ (Z). In particular, if k1 and k2 generate the = k 2 (z, z) d|ν|(z) ˆ same semimetric ρ, then Mn/2(Z) = Mn/2(Z). k1 k2 | {z } A 2θ 1 2θ n   n Proof. Let θ ≥ . Note that a is a convex function X r n r n−r 2 + (−1) k 2 (z , z ) k 2 (z, z) d|ν|(z) . θ r 0 0 ˆ of a. Suppose ν ∈ Mk(Z). Then, we have r=1 | {z } B θ ρ (z, z0) d|ν|(z) ˆ n Note that the terms in B are finite as for θ ≥ 2 ≥ (n−1)/2 = kk(·, z) − k(·, x )k2θ d|ν|(z) n−1 ≥ · · · ≥ 1 , we have Mθ(Z) ⊂ M (Z) ⊂ ˆ 0 Hk 2 2 ρ k · · · ⊂ M1 (Z) ⊂ M1/2(Z) and therefore A is finite, 2θ k k ≤ (kk(·, z)kHk + kk(·, z0)kHk ) d|ν|(z) n/2 θ n/2 ˆ which means ν ∈ Mk (Z), i.e., Mρ(Z) ⊂ Mk (Z)   n θ θ 2θ−1 2θ 2θ for θ ≥ 2 . The result shows that Mρ(Z) = Mk(Z) ≤ 2 kk(·, z)kHk d|ν|(z) + kk(·, z0)kHk d|ν|(z) n ˆ ˆ for all θ ∈ { 2 : n ∈ N}.   = 22θ−1 kθ(z, z) d|ν|(z) + kθ(z , z )|ν|(Z) ˆ 0 0 The above Proposition gives a natural interpretation < ∞, of conditions on probability measures in terms of mo- ments w.r.t. ρ. Namely, the kernel embedding µk(P ), where we have invoked the Jensen’s inequality for where kernel k generates the semimetric ρ, exists for convex functions. From the above it is clear that every P with finite half-moment w.r.t. ρ, and thus, θ θ Mk(Z) ⊂ Mρ(Z), for θ ≥ 1/2. MMD between P and Q, γk(P,Q) is well defined when- To prove the other direction, we show by induction ever both P and Q have finite half-moments w.r.t. ρ. n/2 If, in addition, P and Q have finite first moments that Mθ(Z) ⊂ M (Z) for θ ≥ n , n ∈ . Let n = 1. ρ k 2 N w.r.t. ρ, then the ρ-energy distance between P and Q Let θ ≥ 1 , and suppose that ν ∈ Mθ(X ). Then, by 2 ρ is also well defined and it must be equal to the MMD, invoking the reverse triangle and Jensen’s inequalities, by Theorem 11. we have: Rather than imposing the condition on Borel prob- θ 2θ ability measures, one may assume that the underly- ρ (z, z0)d |ν| (z) = kk(·, z) − k(·, z0)kH d|ν|(z) ˆ ˆ k ing semimetric space (Z, ρ) of negative type is itself 2θ 0 1/2 1/2 bounded, i.e., that sup 0 ρ(z, z ) < ∞, implying ≥ k (z, z) − k (z0, z0) d|ν|(z) z,z ∈Z ˆ that distance kernels are bounded functions, and that 2θ both MMD and energy distance are always defined. ≥ k1/2(z, z) d|ν|(z) − kνk k1/2(z , z ) , ˆ TV 0 0 Conversely, bounded kernels (such as Gaussian) always induce bounded semimetrics. 1/2 which implies ν ∈ Mk (Z), thereby satisfying the result for n = 1. Suppose the result holds for θ ≥ Hypothesis Testing Using Pairwise Distances and Associated Kernels

Table 1. MMD with distance kernels on data from Gretton et al.(2009a). Dimensionality is: Neural I (64), Neural II (100), Health status (12,600), Subtype (2,118). The boldface denotes instances where distance kernel had smaller Type II error in comparison to Gaussian kernel. Gauss dist (1/3) dist (2/3) dist (1) dist (4/3) dist (5/3) dist (2) Neural I 1- Type I .956 .969 .964 .949 .952 .959 .959 (m = 200) Type II .118 .170 .139 .119 .109 .089 .117 Neural I 1- Type I .950 .969 .946 .962 .947 .930 .953 (m = 250) Type II .063 .075 .045 .041 .040 .065 .052 Neural II 1- Type I .956 .968 .965 .963 .956 .958 .943 (m = 200) Type II .292 .485 .346 .319 .297 .280 .290 Neural II 1- Type I .963 .980 .968 .950 .952 .960 .941 (m = 250) Type II .195 .323 .197 .189 .194 .169 .183 Subtype 1- Type I .975 .974 .977 .971 .966 .962 .966 (m = 10) Type II .055 .828 .237 .092 .042 .033 .024 Health st. 1- Type I .958 .980 .953 .940 .954 .954 .955 (m = 20) Type II .036 .037 .039 .081 .114 .120 .165

A.5. it to that of a Gaussian kernel on real-world multi- variate datasets: Health st. (microarray data from The notion of distance covariance extends naturally normal and tumor tissues), Subtype (microarray data to that of distance variance V2(X) = V2(X,X) and from different subtypes of cancer) and Neural I/II (lo- that of distance correlation (in analogy to the Pearson cal field potential (LFP) electrode recordings from the product-moment correlation coefficient): Macaque primary visual cortex (V1) with and without ( 2 V (X,Y ) , V(X)V(Y ) > 0, spike events), all discussed in Gretton et al.(2009a). R2(X,Y ) = V(X)V(Y ) In contrast to Gretton et al.(2009a), we used smaller 0, V(X)V(Y ) = 0. sample sizes, so that some Type II error persists. At Distance correlation also has a straightforward inter- higher sample sizes, all tests exhibit Type II error pretation in terms of kernels as: which is virtually zero. The results are reported in Table1 below. We used the spectral test for all ex- V2(X,Y ) R2(X,Y ) = periments, and the reported averages are obtained by V(X)V(Y ) running 1000 trials. We note that for dataset Subtype 2 which is high dimensional but with only a small num- γk(PXY ,PX PY ) = ber of dimensions varying in mean, a larger exponent γk(PXX ,PX PX )γk(PYY ,PY PY ) results in a test of greater power. kΣ k2 = XY HS , kΣXX kHS kΣYY kHS where covariance operator ΣXY : HkX → HkY is a linear operator for which hΣXY f, gi = HkY

EXY [f(X)g(Y )] − EX f(X)EY g(Y ), for all f ∈ HkX and g ∈ HkY , and k·kHS denotes the Hilbert-Schmidt norm (Gretton et al., 2005b). It is clear that R is invariant to scaling (X,Y ) 7→ (X, Y ),  > 0, whenever the corresponding semimetrics are homoge- 0 0 neous, i.e., whenever ρX (x, x ) = ρX (x, x ), and similarly for ρY . Moreover, R is invariant to trans- lations (X,Y ) 7→ (X + x0,Y + y0), x0 ∈ X , y0 ∈ Y, whenever ρX and ρY are translation invariant.

A.6. Further Experiments We assessed performance of two-sample tests based on distance kernels with various exponents and compared