arXiv:1205.0411v2 [cs.LG] 21 May 2012 pedxaalbeat available Appendix ∗ Gretton Arthur Sejdinovic Dino oyih 02b h author(s)/owner(s). the by 2012 Copyright Learning Machine on ence pern in Appearing hrt Sriperumbudur Bharath ej Fukumizu Kenji o hte w needn ape r rmthe from are samples spaces independent Euclidean two sta- in whether the testing (of On two- statis- side, communities. the learning tistical in machine work considerable and of tics focus recent has a high and been in challenging, particularly hypotheses is statistical spaces dimensional testing of problem The Introduction 1. ⋆ ietSses Tübingen, Systems, ligent yohssTsigUigPiws itne n Associate and Pairwise Using Testing Hypothesis asyCmuainlNuocec nt SL University CSML, Unit, Neuroscience Computational Gatsby hs uhr otiue equally. contributed authors These n htohrcocsfo hsfml can family tests. this kernels, powerful from more of choices yield family other that parametric and one a just of is most member in energy employed the commonly that in particular show kernels we in of tests: independence family and this two-sample of investigate performance we Finally, the an injective). to are distributions RKHS the of for embeddings is, which (that characteristic are proba- induced semimetrics kernels by of which class for the distributions bility determine We energy dis- the distance. between to exactly distance defined corresponds RKHS be tributions may the kernel that a such negative case which of in semimetrics type, with computed distances are energy when holds learning. spaces equivalence machine The Hilbert in established kernel as (RKHS), distribu- reproducing to of other, the tions embeddings on between literature; distances statistics distance the the and from hand, distances one and the energy two-sample on in testing: used independence statistics of two linking classes framework unifying a provide We rceig fthe of Proceedings ⋆ ⋆, ‡ † , ∗ arxiv.org/abs/1205.0411 Abstract dnug,Soln,U,2012. UK, Scotland, Edinburgh, , ‡ h nttt fSaitclMteais Tokyo Mathematics, Statistical of Institute The ⋆, ∗ 29 th nentoa Confer- International . e ftsigfridpnec,wt h associ- the prob- ( with (HSIC) Indepen- the Hilbert-Schmidt Criterion independence, the dence to being for applied test testing ated (MMD). was of discrepancy measure lem distance statistic maximum this This the called embeddings: is these between difference work, follow-up a As em- Eu- straightforward pairwise estimates. to of pirical leads expectations which certain distances, of clidean representa- terms compact in their tion energy of is advantage statistics particular distance-based A discussion. and tion eacmlse sn ocle nrydsac as distance energy ( so-called statistic a a using can accomplished distributions) different be from or distribution, same nrydsac-ae et n enlbsdtests between kernel-based link and the tests similarity, distance-based striking energy their Despite ( high re- of distance A vectors the between is moments. dimension measure the first as dependence finite long lated have as alternatives variables all random against consistent are oan,sc stx tig,gah rgroups or graphs strings, text as non-Euclidean ( and such structured to domains, generalized be alternatives ther all ( against used is 2008 consistent RKHS characteristic be a when to shown are al. et Smola im- major a ( with had community, statistics has the covariance mo- in pact first distance The bounded with ment. variables for consistent again 2007 itiuin norpouigkre ibr spaces Hilbert kernel probability ( reproducing of into embeddings have on distributions tests based two-sample formulated side, been learning machine the On are spaces case). Euclidean special which a (of type negative to of covariance spaces distance of notion the generalized uuiue al. et Fukumizu al. et Gretton 2009 ; ; olg London, College en copne ya dtra introduc- editorial an by accompanied being ) rprmuu tal. et Sriperumbudur zkl Rizzo & Székely zkl Rizzo & Székely , 2007 , 2012 , 2009 ; [email protected] [email protected] [email protected] hn tal. et Zhang ,uiga h etsaitcthe statistic test the as using ), † , ). a lnkIsiuefrIntel- for Institute Planck Max 2009 rto tal. et Gretton , ,adtersligts is test resulting the and ), , 2010 [email protected] 2004 , .Sc et a fur- can tests Such ). ; 2011 2005 uuiue al. et Fukumizu zkl Rizzo & Székely Kernels d .Bt tests Both ). , zkl tal. et Székely .Sc tests Such ). Lyons 2005a ; ( 2011 2008 ) , ; , Hypothesis Testing Using Pairwise Distances and Associated Kernels has been an open question. In the discussion tion 5. Finally, in Section 6, we investigate the per- of Székely & Rizzo (2009), Gretton et al. (2009b, formance of the test statistics on a variety of testing p. 1289) first explored this link in the context of problems, which demonstrate the strengths of the new independence testing, and stated that interpreting kernel family. the distance-based independence statistic as a kernel statistic is not straightforward, since Bochner’s theo- 2. Definitions and Notation rem does not apply to the choice of weight function used in the definition of Brownian distance covari- In this section, we introduce concepts and notation ance (we briefly review this argument in Section A.3 required to understand reproducing kernel Hilbert of the Appendix). Székely & Rizzo (2009, Rejoinder, spaces (Section 2.1), and distribution embeddings into p. 1303) confirmed this conclusion, and commented RKHS. We then introduce semimetrics (Section 2.2), that RKHS-based dependence measures do not seem to and review the relation of semimetrics of negative type be formal extensions of Brownian distance covariance to RKHS kernels. because the weight function is not integrable. Our con- tribution resolves this question and shows that RKHS- 2.1. RKHS Definitions based dependence measures are precisely the formal extensions of Brownian distance covariance, where the Unless stated otherwise, we will assume that Z is any problem of non-integrability of weight functions is cir- topological space. cumvented by using translation-variant kernels, i.e., Definition 1. (RKHS) Let H be a Hilbert space distance-induced kernels, a novel family of kernels that of real-valued functions defined on Z. A function we introduce in Section 2.2. k : Z×Z → R is called a reproducing kernel of H In the case of two-sample testing, we demonstrate that if (i) ∀z ∈ Z, k(·,z) ∈ H, and (ii) ∀z ∈ Z, ∀f ∈ energy distances are in fact maximum mean discrepan- H, hf, k(·,z)iH = f(z). If H has a reproducing kernel, cies arising from the same family of distance-induced it is called a reproducing kernel Hilbert space (RKHS). kernels. A number of interesting consequences arise from this insight: first, we show that the energy dis- According to the Moore-Aronszajn theorem tance (and distance covariance) derives from a partic- (Berlinet & Thomas-Agnan, 2004, p. 19), for every ular parameter choice from a larger family of kernels: symmetric, positive definite function k : Z×Z→ R, this choice may not yield the most sensitive test. Sec- there is an associated RKHS Hk of real-valued ond, results from Gretton et al. (2009a); Zhang et al. functions on Z with reproducing kernel k. The map (2011) may be applied to get consistent two-sample ϕ : Z → Hk, ϕ : z 7→ k(·,z) is called the canonical and independence tests for the , with- feature map or the Aronszajn map of k. We will say out using bootstrap, which perform much better than that k is a nondegenerate kernel if its Aronszajn map the upper bound proposed by Székely et al. (2007) as is injective. an alternative to the bootstrap. Third, in relation to Lyons (2011), we obtain a new family of characteristic 2.2. Semimetrics of Negative Type kernels arising from semimetric spaces of negative type We will work with the notion of semimetric of nega- (where the triangle inequality need not hold), which tive type on a non-empty set Z, where the “distance” are quite unlike the characteristic kernels defined via function need not satisfy the triangle inequality. Note Bochner’s theorem (Sriperumbudur et al., 2010). that this notion of semimetric is different to that which The structure of the paper is as follows: In Section arises from the seminorm, where distance between two 2, we provide the necessary definitions from RKHS distinct points can be zero (also called pseudonorm). theory, and the relation between RKHS and semimet- rics of negative type. In Section 3.1, we review both Definition 2. (Semimetric) Let Z be a non-empty the energy distance and distance covariance. We re- set and let ρ : Z×Z → [0, ∞) be a function such ′ ′ ′ late these quantities in Sections 3.2 and 3.3 to the that ∀z,z ∈ Z, (i) ρ(z,z )=0 if and only if z = z , ′ ′ Maximum Mean Discrepancy (MMD) and the Hilbert- and (ii) ρ(z,z )= ρ(z ,z). Then (Z,ρ) is said to be a Schmidt Independence Criterion (HSIC), respectively. semimetric space and ρ is called a semimetric on Z. If, ′ ′′ ′ ′′ ′ We give conditions for these quantities to distinguish in addition, (iii) ∀z,z ,z ∈ Z, ρ(z ,z ) ≤ ρ(z,z )+ ′′ between probability measures in Section 4, thus ob- ρ(z,z ), (Z,ρ) is said to be a and ρ is taining a new family of characteristic kernels. Empir- called a metric on Z. ical estimates of these quantities and associated two- Definition 3. (Negative type) The semimetric sample and independence tests are described in Sec- space (Z,ρ) is said to have negative type if ∀n ≥ 2, Hypothesis Testing Using Pairwise Distances and Associated Kernels

R n ′ ′ ′ ′ z1,...,zn ∈ Z, and α1,...,αn ∈ with i=1 αi =0, 2. ρ(z,z ) = k(z,z) + k(z ,z ) − 2k(z,z ) = kk(·,z) − k(·,z′)k2 . n n P Hk α α ρ(z ,z ) ≤ 0. (1) i j i j Note that Proposition 7 implies that the Aronszajn i=1 j=1 X X map z 7→ k(·,z) is an isometric embedding of a metric 1/2 space (Z,ρ ) into Hk, for every k ∈ Kρ. Note that in the terminology of Berg et al. (1984), ρ satisfying (1) is said to be a negative definite func- 2.3. Kernels Inducing Semimetrics tion. The following theorem is a direct consequence of Berg et al. (1984, Proposition 3.2, p. 82). We now further develop the link between semimetrics Proposition 4. ρ is a semimetric of negative type of negative type and kernels. Let k be any nonde- if and only if there exists a Hilbert space H and an generate reproducing kernel on Z (for example, every injective map ϕ : Z → H, such that strictly positive definite k is nondegenerate). Then, by Proposition 4, ′ ′ 2 ′ ′ ′ ′ ρ(z,z )= kϕ(z) − ϕ(z )kH (2) ρ(z,z )= k(z,z)+ k(z ,z ) − 2k(z,z ) (3) This shows that (Rd, k· − ·k2) is of negative type. defines a valid semimetric ρ of negative type on Z. From Berg et al. (1984, Corollary 2.10, p. 78), we have We will say that k generates ρ. It is clear that every ˜ ˜ that: distance kernel k ∈ Kρ also generates ρ, and that k can be expressed as: Proposition 5. If ρ satisfies (1), then so does ρq, for ˜ ′ ′ ′ 0

Lemma 6. Let Z be a nonempty set, and let ρ be a ′ 1 q ′ q ′ q ′ kq(z,z )= kzk + kz k −kz − z k . (5) semimetric on Z. Let z0 ∈ Z, and denote k(z,z ) = 2 ′ ′ ρ(z,z0)+ρ(z ,z0)−ρ(z,z ). Then k is positive definite Example 9. Let Z= Rd, and consider the Gaussian if and only if ρ satisfies (1). ′ 2 kernel k(z,z′)= e−σkz−z k . The induced semimetric ′ 2 We call the kernel k defined above the distance-induced is ρ(z,z′)=2 1 − e−σkz−z k . There are many other kernel, and say that it is induced by the semimetric kernels that generateh ρ, however;i for example, the dis- ρ. For brevity, we will drop “induced” hereafter, and tance kernel induced by ρ and “centered at zero” is say that k is simply the distance kernel (with some ′ 2 2 ′ 2 k˜(z,z′)= e−σkz−z k +1 − e−σkzk − e−σkz k . abuse of terminology). In addition, we will typically work with distance kernels scaled by 1/2. Note that k(z0,z0)=0, so distance kernels are not strictly pos- 3. Distances and Covariances itive definite (equivalently, k(·,z ) =0). By vary- 0 In this section, we begin with a description of the en- ing “the point at the center” z0, one obtains a fam- 1 ′ ′ ergy distance, which measures distance between dis- ily Kρ = [ρ(z,z0)+ ρ(z ,z0) − ρ(z,z )] of dis- 2 z0∈Z tributions; and distance covariance, which measures tance kernels induced by ρ. We may now express (2)  dependence. We then demonstrate that the former is from Proposition 4 in terms of the canonical feature a special instance of the maximum mean discrepancy map for the RKHS H (proof in Appendix A.1). k (a kernel measure of distance on distributions), and Proposition 7. Let (Z,ρ) be a semimetric space of the latter an instance of the Hilbert-Schmidt Indepen- negative type, and k ∈ Kρ. Then: dence criterion (a kernel dependence measure). We will denote by M(Z) the set of all finite signed Borel 1 1. k is nondegenerate, i.e., the Aronszajn map z 7→ measures on Z, and by M+(Z) the set of all Borel k(·,z) is injective. probability measures on Z. Hypothesis Testing Using Pairwise Distances and Associated Kernels

3.1. Energy Distance and Distance Covariance of P into the RKHS Hk is µk(P ) ∈ Hk such that EZ∼P f(Z)= hf,µk(P )i for all f ∈ Hk. Székely & Rizzo (2004; 2005) use the following mea- Hk sure of between two probability Alternatively, the kernel embedding can be defined by Rd measures P and Q on , termed the energy distance: the Bochner expectation µk(P ) = EZ∼P k(·,Z). By ′ the Riesz representation theorem, a sufficient condi- DE(P,Q)=2EZW kZ − W k− EZZ′ kZ − Z k ′ tion for the existence of µk(P ) is that k is Borel- −EW W ′ kW − W k , (6) 1/2 measurable and that EZ∼P k (Z,Z) < ∞. If k is ′ i.i.d. ′ i.i.d. a bounded continuous function, this is obviously true where Z,Z ∼ P and W, W ∼ Q. This quantity 1 characterizes the equality of distributions, and in the for all P ∈M+(Z). Kernel embeddings can be used to scalar case, it coincides with twice the Cramer-Von induce metrics on the spaces of probability measures, Mises distance. We may generalize it to a semimetric giving the maximum mean discrepancy (MMD), space of negative type (Z,ρ), with the expression for 2 2 γ (P,Q)= kµk(P ) − µk(Q)k this generalized distance covariance D (P,Q) being k Hk E,ρ E ′ E ′ of the same form as (6), with the Euclidean distance = ZZ′ k(Z,Z )+ W W ′ k(W, W ) replaced by ρ. Note that the negative type of ρ im- −2EZW k(Z, W ), (8) plies the non-negativity of DE,ρ. In Section 3.2, we ′ i.i.d. ′ i.i.d. will show that for every ρ, DE,ρ is precisely the MMD where Z,Z ∼ P and W, W ∼ Q. If the re- 1 associated to a particular kernel k on Z. striction of µk to some P(Z) ⊆ M+(Z) is well de- fined and injective, then k is said to be characteristic Now, let X be a random vector on Rp and Y a ran- to P(Z), and it is said to be characteristic (without dom vector on Rq. The distance covariance was intro- further qualification) if it is characteristic to M1 (Z). duced in Székely et al. (2007); Székely & Rizzo (2009) + When k is characteristic, γ is a metric on M1 (Z), i.e., to address the problem of testing and measuring de- k + γ (P,Q)=0 iff P = Q, ∀P,Q ∈M1 (Z). Conditions pendence between X and Y , in terms of a weighted L - k + 2 under which kernels are characteristic have been stud- distance between characteristic functions of the joint ied by Sriperumbudur et al. (2008); Fukumizu et al. distribution of X and Y and the product of their (2009); Sriperumbudur et al. (2010). An alternative marginals. Given a particular choice of weight func- interpretation of (8) is as an integral probability met- tion, it can be computed in terms of certain expecta- ric (Müller, 1997): see Gretton et al. (2012) for details. tions of pairwise Euclidean distances, In general, distance kernels are continuous but un- 2 E E ′ ′ V (X, Y )= XY X′Y ′ kX − X kkY − Y k (7) bounded functions. Thus, kernel embeddings are not E E ′ E E ′ + X X′ kX − X k Y Y ′ kY − Y k defined for all Borel probability measures, and one ′ ′ −2EX′Y ′ [EX kX − X k EY kY − Y k] , needs to restrict the attention to a class of Borel proba- 1/2 bility measures for which EZ∼P k (Z,Z) < ∞ when ′ ′ i.i.d. where (X, Y ) and (X , Y ) are ∼ PXY . Recently, discussing the maximum mean discrepancy. We will Lyons (2011) established that the generalization of the assume that all Borel probability measures considered distance covariance is possible to metric spaces of neg- satisfy a stronger condition that EZ∼P k(Z,Z) < ∞ ative type, with the expression for this generalized dis- (this reflects a finite first condition on random 2 tance covariance VρX ,ρY (X, Y ) being of the same form variables considered in distance covariance tests, and as (7), with Euclidean distances replaced by metrics of will imply that all quantities appearing in our results negative type ρX and ρY on domains X and Y , respec- are well defined). For more details, see Section A.4 tively. In Section 3.3, we will show that the general- in the Appendix. As an alternative to requiring this ized distance covariance of a pair of random variables condition, one may assume that the underlying semi- X and Y is precisely HSIC associated to a particular metric space (Z,ρ) of negative type is itself bounded, ′ kernel k on the product of domains of X and Y . i.e., that supz,z′∈Z ρ(z,z ) < ∞. We are now able to describe the relation between the 3.2. Maximum Mean Discrepancy maximum mean discrepancy and the energy distance. The notion of the feature map in an RKHS (Sec- The following theorem is a consequence of Lemma 6, tion 2.1) can be extended to kernel embeddings and is proved in Section A.1 of the Appendix. of probability measures (Berlinet & Thomas-Agnan, Theorem 11. Let (Z,ρ) be a semimetric space of neg- 2004; Sriperumbudur et al., 2010). ative type and let z0 ∈ Z. The distance kernel k in- 2 1 Definition 10. (Kernel embedding) Let k be a ker- duced by ρ satisfies γk(P,Q) = 2 DE,ρ(P,Q). In par- 1 nel on Z, and P ∈ M+(Z). The kernel embedding ticular, γk does not depend on the choice of z0. Hypothesis Testing Using Pairwise Distances and Associated Kernels

There is a subtlety to the link between kernels and We remark that a similar result to Theorem 12 is given semimetrics, when used in computing the distance on by Lyons (2011, Proposition 3.16), but without mak- probabilities. Consider again the family of distance ing use of the RKHS equivalence. Theorem 12 is a kernels Kρ, where the semimetric ρ is itself generated more general statement, in the sense that we allow ρ from k according to (3). As we have seen, it may to be a semimetric of negative type, rather than a met- 2 be that k∈ / Kρ, however it is clear that γk(P,Q) = ric. In addition to yielding a more general statement, 1 2 DE,ρ(P,Q) whenever k generates ρ. Thus, all ker- the RKHS equivalence leads to a significantly simpler nels that generate the same semimetric ρ on Z give proof: the result is an immediate application of the rise to the same metric γk on (possibly a subset of) HSIC expansion of Smola et al. (2007). 1 M+(Z), and γk is merely an extension of the met- ric ρ1/2 on the point masses. The kernel-based and 4. Distinguishing Probability distance-based methods are therefore equivalent, pro- Distributions vided that we allow “distances” ρ which may not satisfy the triangle inequality. Lyons (2011, Theorem 3.20) shows that distance co- in a metric space characterizes independence 3.3. The Hilbert-Schmidt Independence if the metrics satisfy an additional property, termed Criterion strong negative type. We will extend this notion to a semimetric ρ. We will say that P ∈ M1 (Z) has Given a pair of jointly observed random variables + a finite first moment w.r.t. ρ if ρ(z,z )dP is finite (X, Y ) with values in X × Y, the Hilbert-Schmidt In- 0 for some z0 ∈ Z. It is easy to see´ that the integral dependence Criterion (HSIC) is computed as the max- ρ d ([P − Q] × [P − Q]) = −D (P,Q) converges imum mean discrepancy between the joint distribution E,ρ ´whenever P and Q have finite first moments w.r.t. PXY and the product of its marginals PX PY . Let kX ρ. In Appendix A.4, we show that this condition is and kY be kernels on X and Y, with respective RKHSs equivalent to EZ∼P k(Z,Z) < ∞, for a kernel k that HkX and HkY . Following Smola et al. (2007, Section generates ρ, which implies the kernel embedding µk(P ) 2.3), we consider the MMD associated to the kernel is also well defined. k ((x, y) , (x′,y′)) = k (x, x′)k (y,y′) on X × Y with X Y Definition 13. The semimetric space (Z,ρ) is said RKHS Hk isometrically isomorphic to the tensor prod- 1 2 to have a strong negative type if ∀P,Q ∈M+(Z) with uct HkX ⊗ HkY . It follows that θ := γk(PXY , PX PY ) finite first moment w.r.t. ρ, with P 6= Q ⇒ ρ d ([P − Q] × [P − Q]) < 0. (10) θ = EXY [kX (·,X) ⊗ kY (·, Y )] ˆ 2 E E 2 − X kX (·,X) ⊗ Y kY (·, Y ) The quantity in (10) is exactly −2γk(P,Q) for all P,Q H ⊗H kX kY with finite first moment w.r.t. ρ. We directly obtain: ′ ′ = E E ′ ′ k (X,X )k (Y, Y ) XY X Y X Y Proposition 14. Let kernel k generate ρ. Then (Z,ρ) ′ ′ +EX EX′ kX (X,X )EY EY ′ kY (Y, Y ) has a strong negative type if and only if k is character- ′ ′ −2EX′Y ′ [EX kX (X,X )EY kY (Y, Y )] , istic to all probability measures with finite first moment w.r.t. ρ. where in the last step we used that hf ⊗ g,f ′ ⊗ g′i = hf,f ′i hg,g′i . Thus, the problems of checking whether a semimetric HkX ⊗HkY HkX HkX It can be shown that this quantity is the squared is of strong negative type and whether its associated Hilbert-Schmidt norm of the covariance operator be- kernel is characteristic to an appropriate space of Borel tween RKHSs (Gretton et al., 2005b). The following probability measures are equivalent. This conclusion theorem demonstrates the link between HSIC and the has some overlap with Lyons (2011): in particular, distance covariance, and is proved in Appendix A.1. Proposition 14 is stated in Lyons (2011, Proposition 3.10), where the barycenter map β is a kernel embed- Theorem 12. Let (X ,ρ ) and (Y,ρ ) be semimetric X Y ding in our terminology, although Lyons does not con- spaces of negative type, and (x ,y ) ∈X×Y. Define 0 0 sider distribution embeddings in an RKHS. k ((x, y) , (x′,y′)) ′ ′ := [ρX (x, x0)+ ρX (x , x0) − ρX (x, x )] × 5. Empirical Estimates and Hypothesis ′ ′ [ρY (y,y0)+ ρY (y ,y0) − ρY (y,y )] . (9) Tests Then, k is a positive definite kernel on X × Y, and In the case of two-sample testing, we are given i.i.d. 2 2 z m w n γk(PXY , PX PY )= VρX ,ρY (X, Y ). samples = {zi}i=1 ∼ P and = {wi}i=1 ∼ Q. The Hypothesis Testing Using Pairwise Distances and Associated Kernels empirical (biased) V-statistic estimate of (8) is and distance covariance statistics, respectively, con- m m n n verge to a particular weighted sum of chi-squares of 1 1 γˆ2 (z, w)= k(z ,z )+ k(w , w ) form similar to that found for the kernel-based statis- k,V m2 i j n2 i j i=1 j=1 i=1 j=1 tics. Analogous results for the generalized distance co- X X X X m n variance are presented by Lyons (2011, p. 7–8). These 2 − k(z , w ). (11) works do not propose test designs that attempt to es- mn i j i=1 j=1 timate the coefficients in such representations of the X X Recall that if we use a distance kernel k induced by a null distribution, however (note also that these co- semimetric ρ, this estimate involves only the pairwise efficients have a more intuitive interpretation using ρ-distances between the sample points. kernels). Besides the bootstrap, Székely et al. (2007, Theorem 6) also proposes an independence test using In the case of independence testing, we are given i.i.d. a bound applicable to a general quadratic form Q of z m samples = {(xi,yi)}i=1 ∼ PXY , and the resulting centered Gaussian random variables with E[Q]=1: V-statistic estimate (HSIC) is (Gretton et al., 2005a; P Q ≥ Φ−1(1 − α/2)2 ≤ α, valid for 0 < α ≤ 2008) 0.215. When applied to the distance covariance statis- 1 tic, the upper bound of α is achieved if X and Y are HSIC(z; k , k )= T r(K HK H), (12) X Y m2 X Y independent Bernoulli variables. The authors remark where KX , KY and H are m × m matrices given that the resulting criterion might be over-conservative. Thus, more sensitive tests are possible by computing by (KX )ij := kX (xi, xj ), (KY )ij := kY (yi,yj) and 1 the spectrum of the centred Gram matrices associated Hij = δij − (centering matrix). As in the two- m to distance kernels, and we pursue this approach in the sample case, if both kX and kY are distance kernels, the test statistic involves only the pairwise distances next section. between the samples, i.e., kernel matrices in (12) may be replaced by distance matrices. 6. We would like to design distance-based tests with 6.1. Two-sample Experiments an asymptotic Type I error of α, and thus we re- quire an estimate of the (1 − α)-quantile of the V- In the two-sample experiments, we investigate three statistic distribution under the null hypothesis. Un- different kinds of synthetic . In the first, we com- der the null hypothesis, both (11) and (12) con- pare two multivariate Gaussians, where the dif- verge to a particular weighted sum of chi-squared fer in one dimension only, and all are equal. distributed independent random variables (for more In the second, we again compare two multivariate details, see Section A.2). We investigate two ap- Gaussians, but this time with identical means in all proaches, both of which yield consistent tests: a dimensions, and variance that differs in a single dimen- bootstrap approach (Arcones & Giné, 1992), and a sion. In our third , we use the benchmark spectral approach (Gretton et al., 2009a; Zhang et al., data of Sriperumbudur et al. (2009): one distribution 2011). The latter requires empirical computation of is a univariate Gaussian, and the second is a univari- the spectrum of kernel integral operators, a prob- ate Gaussian with a sinusoidal perturbation of increas- lem studied extensively in the context of kernel PCA ing (where higher frequencies correspond to (Schölkopf et al., 1997). In the two-sample case, one harder problems). All tests use a distance kernel in- computes the eigenvalues of the centred Gram ma- duced by the Euclidean distance. As shown on the left trix K˜ = HKH on the aggregated samples. Here, plots in Figure 1, the spectral and bootstrap test de- signs appear indistinguishable, and they significantly K is a 2m × 2m matrix, with entries Kij = k(ui,uj ), u = [z w] is the concatenation of the two samples outperform the test designed using the quadratic form and H is the centering matrix. Gretton et al. (2009a) bound, which appears to be far too conservative for show that the null distribution defined using these fi- the data sets considered. This is confirmed by check- nite sample estimates converges to the population dis- ing the Type I error of the quadratic form test, which tribution, provided that the spectrum is square-root is significantly smaller than the test size of α =0.05. summable. The same approach can be used for a We also compare the performance to that of the Gaus- consistent finite sample null distribution of HSIC, via sian kernel, with the bandwidth set to the dis- ˜ computation of the eigenvalues of KX = HKX H and tance between points in the aggregation of samples. ˜ KY = HKY H (Zhang et al., 2011). We see that when the means differ, both tests perform Both Székely & Rizzo (2004, p. 14) and Székely et al. similarly. When the variances differ, it is clear that the (2007, p. 2782–2783) establish that the energy distance Gaussian kernel has a major advantage over the dis- Hypothesis Testing Using Pairwise Distances and Associated Kernels

Means differ, α=0.05 Means differ, α=0.05 m=1024, d=4, α=0.05 m=512, α=0.05 1 0.8 1 1 Spectral −dist dist, q=1/3 dist, q=1/6 0.9 0.9 Bootstrap −dist 0.7 dist, q=2/3 0.9 dist, q=1/3 Qform −dist dist, q=1 dist, q=2/3 0.8 Spectral −gauss dist, q=4/3 dist, q=1 0.8 Bootstrap −gauss 0.6 dist, q=5/3 0.8 dist, q=4/3 0.7 Qform −gauss dist, q=2 gauss (median) 0.7 0.5 gauss (median) 0.7 0.6 0.6

0.5 0.4 0.6 0.5

0.4 0.4 0.3 0.5 Type II error Type II error Type II error Type II error 0.3 0.3 dist, q=1/6 0.2 0.4 dist, q=1/3 0.2 0.2 dist, q=2/3 dist, q=1 0.1 0.3 0.1 0.1 dist, q=4/3 gauss (median) 0 0 0.2 0 1 1.5 2 2.5 3 3.5 4 4.5 5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 0.2 0.4 0.6 0.8 1 1 1.5 2 2.5 3 3.5 4 4.5 5 log (dimension) log (dimension) × π 2 2 angle of rotation ( /4) frequency

Variances differ, α=0.05 Variances differ, α=0.05 1 1 Figure 2. 0.9 0.9 HSIC using distance kernels with various expo- 0.8 0.8 nents and a Gaussian kernel as a function of (left) the angle 0.7 0.7 of rotation for the dependence induced by rotation; (right) 0.6 0.6

0.5 0.5 frequency ℓ in the sinusoidal dependence example.

0.4 0.4 Type II error Type II error 0.3 0.3 more sensitive to differences in distributions at finer 0.2 0.2 0.1 0.1 lengthscales (i.e., where the characteristic functions of 0 0 1 1.5 2 2.5 3 3.5 4 4.5 5 1 1.5 2 2.5 3 3.5 4 4.5 5 log (dimension) log (dimension) the distributions differ at higher frequencies). This ob- 2 2

Sinusoidal perturbation, α=0.05 Sinusoidal perturbation, α=0.05 servation also appears to hold true on the real-world 1 1

0.9 0.9 data experiments in Appendix A.6.

0.8 0.8

0.7 0.7

0.6 0.6 6.2. Independence Experiments

0.5 0.5

0.4 0.4

Type II error Type II error To assess independence tests, we used an artificial 0.3 0.3

0.2 0.2 benchmark proposed by Gretton et al. (2008): we gen- 0.1 0.1 erate univariate random variables from the ICA bench- 0 0 1 1.5 2 2.5 3 3.5 4 4.5 5 1 1.5 2 2.5 3 3.5 4 4.5 5 Sinusoid frequency Sinusoid frequency mark densities of Bach & Jordan (2002); rotate them in the product space by an angle between 0 and π/4 to Figure 1. (left) MMD using Gaussian and distance kernels introduce dependence; fill additional dimensions with for various tests; (right) Spectral MMD using distance ker- independent Gaussian noise; and, finally, pass the re- nels with various exponents. The number of samples in all sulting multivariate data through random and inde- experiments was set to m = 200. pendent orthogonal transformations. The resulting random variables X and Y are dependent but uncor- tance kernel, although this advantage decreases with related. The case m = 1024 (sample size) and d = 4 increasing dimension (where both perform poorly). In (dimension) is plotted in Figure 2 (left). As observed the case of a sinusoidal perturbation, the performance by Gretton et al. (2009b), the Gaussian kernel does is again very similar. better than the distance kernel with q = 1. By vary- In addition, following Example 8, we investigate the ing q, however, we are able to obtain a wide of performance of kernels obtained using the semimet- performance; in particular, the values q = 1/6 (and ric ρ(z,z′) = kz − z′kq for 0 < q ≤ 2. Results are smaller) have an advantage over the Gaussian kernel presented in the right hand plots of Figure 1. While on this dataset, especially in the case of smaller an- judiciously chosen values of q offer some improvement gles of rotation. As for the two-sample case, bootstrap in the cases of differing mean and variance, we see and spectral tests have indistinguishable performance, a dramatic improvement for the sinusoidal perturba- and are significantly more sensitive than the quadratic tion, compared with the case q = 1 and the Gaus- form based test, which failed to reject the null hypoth- sian kernel: values q = 1/3 (and smaller) yield vir- esis of independence on this dataset. tually error-free performance even at high frequencies In addition, we assess the test performance on sinu- (note that q =1 corresponds to the energy distance de- soidally dependent data. The distribution over the scribed in Székely & Rizzo (2004; 2005)). Additional pair X, Y was drawn from P ∝ experiments with real-world data are presented in Ap- XY 1 + sin(ℓx) sin(ℓy) for integer ℓ, on the support X × Y, pendix A.6. where X := [−π, π] and Y := [−π, π]. In this way, in- We observe from the simulation results that distance creasing ℓ caused the departure from a uniform (inde- kernels with higher exponents are advantageous in pendent) distribution to occur at increasing frequen- cases where distributions differ in mean value along cies, making this departure harder to detect from a a single dimension (with noise in the remainder), small sample size. Results are in Figure 2 (right). whereas distance kernels with smaller exponents are We note that the distance covariance outperforms the Hypothesis Testing Using Pairwise Distances and Associated Kernels

Gaussian kernel on this example, and that smaller ex- semigroups. In NIPS, 2009. ponents result in better performance (lower Type II Fukumizu, K., Song, L., and Gretton, A. Kernel Bayes’ error when the departure from independence occurs rule. In NIPS, 2011. at higher frequencies). Finally, we note that the set- Gretton, A., Bousquet, O., Smola, A.J., and Schölkopf, B. ting q =1, which is described in Székely et al. (2007); Measuring statistical dependence with Hilbert-Schmidt Székely & Rizzo (2009), is a reasonable heuristic in norms. In ALT, 2005a. practice, but does not yield the most powerful tests Gretton, A., Herbrich, R., Smola, A., Bousquet, O., and Schölkopf, B. Kernel methods for measuring indepen- on either dataset. dence. J. Mach. Learn. Res., 6:2075–2129, 2005b. Gretton, A., Fukumizu, K., Teo, C.H., Song, L., 7. Conclusion Schoelkopf, B., and Smola, A. A kernel statistical test of independence. In NIPS, 2008. We have established an equivalence between the energy Gretton, A., Fukumizu, K., Harchaoui, Z., and Sriperum- distance and distance covariance, and RKHS measures budur, B. A fast, consistent kernel two-sample test. In of distance between distributions. In particular, en- NIPS, 2009a. ergy distances and RKHS distance measures coincide Gretton, A., Fukumizu, K., and Sriperumbudur, B. Dis- when the kernel is induced by a semimetric of nega- cussion of: Brownian distance covariance. Ann. Appl. Stat. tive type. The associated family of kernels performs , 3(4):1285–1294, 2009b. well in two-sample and independence testing: interest- Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., and Smola, A. A kernel two-sample test. J. Mach. Learn. ingly, the parameter choice most commonly used in the Res., 13:723–773, 2012. statistics literature does not yield the most powerful Lyons, R. Distance covariance in metric spaces. tests in many settings. arXiv:1106.5758, June 2011. The interpretation of the energy distance and dis- Müller, A. Integral probability metrics and their generating tance covariance in an RKHS setting should be of classes of functions. Adv. Appl. Probab., 29(2):429–443, 1997. considerable interest both to and machine Schölkopf, B., Smola, A. J., and Müller, K.-R. Kernel learning researchers, since the associated kernels may principal component analysis. In ICANN, 1997. be used much more widely: in conditional depen- Smola, A. J., Gretton, A., Song, L., and Schölkopf, B. A dence testing and estimates of the chi-squared dis- Hilbert space embedding for distributions. In ALT, 2007. tance (Fukumizu et al., 2008), in Sriperumbudur, B. Mixture via Hilbert (Fukumizu et al., 2011), in mixture density estimation space embedding of measures. In IEEE ISIT, 2011. (Sriperumbudur, 2011) and in other Sriperumbudur, B., Gretton, A., Fukumizu, K., Lanckriet, applications. In particular, the link with kernels makes G., and Schölkopf, B. Injective Hilbert space embeddings these applications of the energy distance immediate of probability measures. In COLT, 2008. and straightforward. Finally, for problem settings de- Sriperumbudur, B., Fukumizu, K., Gretton, A., Lanckriet, fined most naturally in terms of distances, and where G., and Schoelkopf, B. Kernel choice and classifiability these distances are of negative type, there is an in- for RKHS embeddings of probability distributions. In NIPS, 2009. terpretation in terms of reproducing kernels, and the Sriperumbudur, B., Gretton, A., Fukumizu, K., Lanckriet, learning machinery from the kernel literature can be G., and Schölkopf, B. Hilbert space embeddings and brought to bear. metrics on probability measures. J. Mach. Learn. Res., 11:1517–1561, 2010. References Steinwart, I. and Christmann, A. Support Vector Ma- chines. Springer, 2008. Arcones, M. and Giné, E. On the bootstrap of U and V Székely, G. and Rizzo, M. Testing for equal distributions statistics. Ann. Stat., 20(2):655–674, 1992. in high dimension. InterStat, (5), November 2004. Bach, F. R. and Jordan, M. I. Kernel independent compo- Székely, G. and Rizzo, M. A new test for multivariate nent analysis. J. Mach. Learn. Res., 3:1–48, 2002. normality. J. Multivariate Anal., 93:58–80, 2005. Berg, C., Christensen, J. P. R., and Ressel, P. Harmonic Székely, G. and Rizzo, M. Brownian distance covariance. Analysis on Semigroups. Springer, New York, 1984. Ann. Appl. Stat., 4(3):1233–1303, 2009. Berlinet, A. and Thomas-Agnan, C. Reproducing Ker- Székely, G., Rizzo, M., and Bakirov, N.K. Measuring and nel Hilbert Spaces in Probability and Statistics. Kluwer, testing dependence by correlation of distances. Ann. 2004. Stat., 35(6):2769–2794, 2007. Fukumizu, K., Gretton, A., Sun, X., and Schoelkopf, B. Zhang, K., Peters, J., Janzing, D., and Schoelkopf, B. Kernel measures of conditional dependence. In NIPS, Kernel-based conditional independence test and appli- 2008. cation in causal discovery. In UAI, 2011. Fukumizu, K., Sriperumbudur, B., Gretton, A., and Schoelkopf, B. Characteristic kernels on groups and Hypothesis Testing Using Pairwise Distances and Associated Kernels

′ ′ A. Appendix θ2 = EX EX′ ρX (X,X )EY EY ′ ρY (Y, Y ) E E A.1. Proofs +4 X ρX (X, x0) Y ρY (Y,y0) ′ −2E ρ (X, x )E E ′ ρ (Y, Y ) ′ X X 0 Y Y Y Proof. (Proposition 7) If z,z ∈ Z are such that ′ −2E ρ (Y,y )E E ′ ρ (X,X ), k(w,z)= k(w,z′), for all w ∈ Z, one would also have Y Y 0 X X X ′ ′ ρ(z,z0) − ρ(z, w) = ρ(z ,z0) − ρ(z , w), for all w ∈ Z. In particular, by inserting w = z, and w = z′, we ob- and ′ ′ ′ tain ρ(z,z ) = −ρ(z,z )=0, i.e., z = z . The second ′ ′ E ′ ′ E E statement follows readily by expressing k in terms of θ3 = X Y [ X ρX (X,X ) Y ρY (Y, Y )] ρ. +3EX ρX (X, x0)EY ρY (Y,y0)

+EXY ρX (X, x0)ρY (Y,y0) Proof. (Theorem 11) Follows directly by inserting E E ′ the distance kernel from Lemma 6 into (8), and can- − XY [ρX (X, x0) Y ′ ρY (Y, Y )] ′ E E ′ celling out the terms dependant on a single random − XY [ρY (Y,y0) X ρX (X,X )] 2 ′ variable. Define θ := γk(P,Q). −EX ρX (X, x0)EY EY ′ ρY (Y, Y ) ′ −EY ρY (Y,y0)EX EX′ ρX (X,X ). 1 ′ ′ θ = EZZ′ [ρ(Z,z0)+ ρ(Z ,z0) − ρ(Z,Z )] 2 The claim now follows by inserting the resulting expan- 1 ′ ′ + EW W ′ [ρ(W, z0)+ ρ(W ,z0) − ρ(W, W )] sions and cancelling the appropriate terms. Note that 2 only the leading terms in the expansions remain. −EZW [ρ(Z,z0)+ ρ(W, z0) − ρ(Z, W )] ′ ′ EZZ′ ρ(Z,Z ) EW W ′ ρ(W, W ) = EZW ρ(Z, W ) − − . Remark 15. It turns out that k is not characteristic 2 2 1 to M+(X × Y) — i.e., it cannot distinguish between any two distributions on X × Y, even if kX and kY are characteristic. However, since γk is equal to the Proof. (Theorem 12) First, we note that k is a Brownian distance covariance, we know that it can al- valid reproducing kernel since k ((x, y) , (x′,y′)) = ways distinguish between any PXY and its product ′ ′ ′ of marginals PX PY in the Euclidean case. Namely, kX (x, x )kY (y,y ), where we have taken kX (x, x ) = ′ ′ ′ ′ ′ note that k((x0,y), (x0,y )) = k((x, y0), (x ,y0)) = 0 ρX (x, x0) + ρX (x , x0) − ρX (x, x ), and kY (y,y ) = ′ ′ ′ ′ for all x, x ∈ X , y,y ∈ Y. That means that ρY (y,y0)+ρY (y ,y0)−ρY (y,y ), as distance kernels in- 1 for every two distinct PY ,QY ∈ M+(Y), one has duced by ρX and ρY , respectively. Indeed, a product of γ2(δ P ,δ Q )=0. Thus, kernel in (9) charac- two reproducing kernels is always a valid reproducing k x0 Y x0 Y kernel on the product space (Steinwart & Christmann, terizes independence but not equality of probability 2008, Lemma 4.6, p. 114). To show equality to measures on the product space. Informally speaking, distance covariance, we start by expanding θ := the independence testing is an easier problem than ho- 2 mogeneity testing on the product space. γk(PXY , PX PY ),

θ1 A.2. Spectral Tests ′ ′ θ = EXY EX′Y ′ kX (X,X )kY (Y, Y ) Assume that the null hypothesis holds, i.e., that P = z }| θ2 { Q. For a kernel k and a Borel probability mea- ˜ ′ E E ′ E E ′ sure P , define a kernel “centred” at P : kP (z,z ) := + X X′ kX (X,X ) Y Y ′ kY (Y, Y ) ′ ′ ′ k(z,z )+ EW W ′ k(W, W ) − EW k(z, W ) − EW k(z , W ), θ3 z }| { with W, W ′ i.i.d.∼ P . Note that as a special case for ′ ′ −2 EX′Y ′ [EX kX (X,X )EY kY (Y, Y )] . P = δz0 we recover the family of kernels in (4), and E ′ ˜ ′ z }| { that ZZ kP (Z,Z )=0, i.e., µk˜P (P )=0. The centred Note that kernel is important in characterizing the null distribu- tion of the V-statistic. To the centred kernel k˜P on E E ′ ′ θ1 = XY X′Y ′ ρX (X,X )ρY (Y, Y ) domain Z, one associates the integral kernel operator E E S : L2 (Z) → L2 (Z) (see Steinwart & Christmann, +2 X ρX (X, x0) Y ρY (Y,y0) k˜P P P 2008, p. 126–127), given by: +2EXY ρX (X, x0)ρY (Y,y0) ′ −2EXY [ρX (X, x0)EY ′ ρY (Y, Y )] ′ S˜ g(z)= k˜P (z, w)g(w) dP (w). (13) −2E [ρ (Y,y )E ′ ρ (X,X )] , kP XY Y 0 X X ˆZ Hypothesis Testing Using Pairwise Distances and Associated Kernels

The following theorem is a special case of approach cannot be used to derive a kernel-based mea- Gretton et al. (2012, Theorem 12). For simplic- sure of dependence (this result was first noted by ity, we focus on the case where m = n. Gretton et al. (2009b), and is included here in the in- Theorem 16. Let Z = {Z }m and W = {W }m be terests of completeness). Let X be a random vector i i=1 i i=1 Rp Rq two i.i.d. samples from P ∈M1 (Z), and let S be a on X = and Y a random vector on Y = . The + k˜P trace class operator.Then characteristic function of X and Y , respectively, will be denoted by fX and fY , and their joint characteris- ∞ m 2 2 tic function by fXY . The distance covariance V(X, Y ) γˆ (Z, W) λiN , (14) 2 k,V i is defined via the norm of fXY − fX fY in a weighted i=1 Rp+q X L2 space on , i.e., i.i.d. N ∞ where Ni ∼ N (0, 1), i ∈ , and {λi}i=1 are the 2 2 eigenvalues of the operator S . V (X, Y )= |fX,Y (t,s) − fX (t)fY (s)| w(t,s) dt ds, k˜P ˆ (16) Note that this result requires that the integral ker- for a particular choice of weight function given by nel operator associated to the underlying probabil- ity measure P is a trace class operator, i.e., that 1 1 w(t,s)= · 1+p 1+q , (17) EZ∼P k(Z,Z) < ∞. As before, the sufficient condition cpcq ktk ksk for this to hold for all probability measures is that k 1+d 2 1+d is a bounded function. In the case of a distance ker- where cd = π /Γ( 2 ), d ≥ 1. An important aspect nel, this is the case if the domain Z has a bounded of distance covariance is that V(X, Y )=0 if and only diameter with respect to the semimetric ρ, i.e., that if X and Y are independent. We next obtain a similar ′ supz,z′∈Z ρ(z,z ) < ∞. statistic in the kernel setting. Write Z = X × Y, and let k(z,z′)= κ(z−z′) be a translation invariant RKHS The null distribution of HSIC takes an analogous form kernel on Z, where κ : Z → R is a bounded continuous to (14) of a weighted sum of chi-squares, but with co- function. Using Bochner’s theorem, κ can be written efficients corresponding to the products of the eigen- as: values of integral operators S and S . The fol- ⊤ k˜P k˜P −z u X Y κ(z)= e dΛ(u), lowing Theorem is in Zhang et al. (2011, Theorem 4) ˆ and gives an asymptotic form for the null distribution of HSIC. See also Lyons (2011, Remark 2.9). for a finite non-negative Borel measure Λ. It follows Z m Gretton et al. (2009b) that Theorem 17. Let = {(Xi, Yi)}i=1 be an i.i.d. sam- ple from PXY = PX PY , with values in X × Y. Let 2 2 2 2 2 γ (PXY , PX PY )= |fX,Y (t,s) − fX (t)fY (s)| dΛ(t,s), S˜ : L (X ) → L (X ), and S˜ : L (Y) → k ˆ kPX PX PX kPY PY L2 (Y) be trace class operators. Then PY which is in clear correspondence with (16). However, ∞ ∞ the weight function in (17) is not integrable — so one Z 2 mHSIC( ; kX , kY ) λiηj Ni,j , (15) cannot find a translation invariant kernel for which γk i=1 j=1 coincides with the distance covariance. By contrast, X X note the kernel in (9) is not translation invariant. where Ni,j ∼ N (0, 1), i, j ∈ N, are independent and {λ }∞ and {η }∞ are the eigenvalues of the opera- i i=1 j j=1 A.4. Restriction on Probability Measures tors S˜ and S˜ , respectively. kPX kPY In general, distance kernels and their products are con- Note that if X and Y have bounded diameters w.r.t. ρX tinuous but unbounded, so kernel embeddings are not and ρY , Theorem 17 applies to distance kernels in- defined for all Borel probability measures. Thus, one 1 duced by ρX and ρY for all PX ∈ M+(X ), PY ∈ needs to restrict the attention to a particular class of 1 M+(Y) . Borel probability measures for which kernel embed- dings exist, and a sufficient condition for this is that 1/2 A.3. A Characteristic Function Based EZ∼P k (Z,Z) < ∞, by the Riesz representation the- Interpretation orem. Let k be a measurable reproducing kernel on Z, and denote, for θ> 0, The distance covariance in (7) was defined by Székely et al. (2007) in terms of a weighted distance between characteristic functions. We briefly review θ θ Mk(Z)= ν ∈M(Z) : k (z,z) d |ν| (z) < ∞ . (18) this interpretation here, however we show that this  ˆ  Hypothesis Testing Using Pairwise Distances and Associated Kernels

1/2 Note that the maximum mean discrepancy γk(P,Q) which implies ν ∈ Mk (Z), thereby satisfying the 1/2 1 result for n = 1. Suppose the result holds for θ ≥ is well defined ∀P,Q ∈Mk (Z) ∩M+(Z). n−1 , i.e., Mθ(Z) ⊂ M(n−1)/2(Z) for θ ≥ n−1 . Let Now, let ρ be a semimetric of negative type. Then, 2 ρ k 2 ν ∈Mθ(Z) for θ ≥ n . Then we have we can consider the class of probability measures that ρ 2 have a finite θ-moment with respect to ρ: θ ρ (z,z0) d|ν|(z) ˆ θ Mρ(Z)= {ν ∈M(Z): ∃z0 ∈ Z, (19) n 2θ 0 n = kk(·,z) − k(·,z )kHk d|ν|(z) θ ˆ s.t. ρ (z,z0) d |ν| (z) < ∞}.  2θ ˆ n n ≥ kk(·,z) − k(·,z0)kH d|ν|(z) ˆ k  To ensure existence of energy distance DE,ρ(P,Q), 2θ 1 n we need to assume that P,Q ∈ M (Z), as other- n θ H 0 H ′ ′ ≥ |kk(·,z)k k − kk(·,z )k k | d|ν|(z) wise expectations EZZ′ ρ(Z,Z ), EW W ′ ρ(W, W ) and ˆ  E 2θ ZW ρ(Z, W ) may be undefined. The following propo- n n ≥ (kk(·,z)kH − kk(·,z0)kH ) d|ν|(z) sition shows that the classes of probability measures in ˆ k k N (18) and (19) coincide at θ = n/2, for n ∈ , whenever 2θ n n ρ is generated by kernel k. r n n−r r = (−1) kk(·,z)kH kk(·,z0)kH d|ν|(z) r k k ˆ r=0   Proposition 18. Let k be a kernel that generates X n/2 semimetric ρ, and let n ∈ N. Then, M (Z) = n k 2 n/2 = k (z,z) d|ν|(z) M (Z). In particular, if k and k generate the ˆ θ 1 2 n/2 n/2 A same semimetric ρ, then Mk1 (Z)= Mk2 (Z). 2θ | n {z } n r n r n−r + (−1) k 2 (z0,z0) k 2 (z,z) d|ν|(z) . 1 2θ r Proof. Let θ ≥ . Note that a is a convex function r=1   ˆ 2 X of a. Suppose ν ∈Mθ (Z). Then, we have k B | {z } θ n ρ (z,z0) d|ν|(z) Note that the terms in B are finite as for θ ≥ 2 ≥ ˆ n−1 1 θ (n−1)/2 2 ≥ ··· ≥ 2 , we have Mρ(Z) ⊂ Mk (Z) ⊂ 2θ = kk(·,z) − k(·,x0)kH d|ν|(z) 1 1/2 ˆ k ··· ⊂ Mk(Z) ⊂ Mk (Z) and therefore A is finite, n/2 θ n/2 2θ which means ν ∈ Mk (Z), i.e., Mρ(Z) ⊂ Mk (Z) ≤ (kk(·,z)kH + kk(·,z0)kH ) d|ν|(z) ˆ k k n θ θ for θ ≥ 2 . The result shows that Mρ(Z) = Mk(Z) n N 2θ−1 2θ 2θ for all θ ∈{ 2 : n ∈ }. ≤ 2 kk(·,z)kH d|ν|(z)+ kk(·,z0)kH d|ν|(z) ˆ k ˆ k 

2θ−1 θ θ The above Proposition gives a natural interpretation = 2 k (z,z) d|ν|(z)+ k (z0,z0)|ν|(Z) ˆ  of conditions on probability measures in terms of mo- < ∞, ments w.r.t. ρ. Namely, the kernel embedding µk(P ), where kernel k generates the semimetric ρ, exists for where we have invoked the Jensen’s inequality for every P with finite half-moment w.r.t. ρ, and thus, convex functions. From the above it is clear that MMD between P and Q, γk(P,Q) is well defined when- θ θ Mk(Z) ⊂Mρ(Z), for θ ≥ 1/2. ever both P and Q have finite half-moments w.r.t. ρ. If, in addition, P and Q have finite first moments To prove the other direction, we show by induction w.r.t. ρ, then the ρ-energy distance between P and Q that Mθ(Z) ⊂Mn/2(Z) for θ ≥ n , n ∈ N. Let n =1. ρ k 2 is also well defined and it must be equal to the MMD, 1 θ Let θ ≥ 2 , and suppose that ν ∈ Mρ(X ). Then, by by Theorem 11. invoking the reverse triangle and Jensen’s inequalities, we have: Rather than imposing the condition on Borel prob- ability measures, one may assume that the underly- θ 2θ ing semimetric space (Z,ρ) of negative type is itself ρ (z,z0)d |ν| (z)= kk(·,z) − k(·,z0)k d|ν|(z) ˆ ˆ Hk ′ bounded, i.e., that supz,z′∈Z ρ(z,z ) < ∞, implying 2θ that distance kernels are bounded functions, and that ≥ k1/2(z,z) − k1/2(z ,z ) d|ν|(z) ˆ 0 0 both MMD and energy distance are always defined. 2θ Conversely, bounded kernels (such as Gaussian) always 1/2 1/2 ≥ k (z,z) d|ν|(z) −kνk k (z0,z0) , ˆ T V induce bounded semimetrics.

Hypothesis Testing Using Pairwise Distances and Associated Kernels

Table 1. MMD with distance kernels on data from Gretton et al. (2009a). Dimensionality is: Neural I (64), Neural II (100), Health status (12,600), Subtype (2,118). The boldface denotes instances where distance kernel had smaller Type II error in comparison to Gaussian kernel. Gauss dist (1/3) dist (2/3) dist (1) dist (4/3) dist (5/3) dist (2) Neural I 1- Type I .956 .969 .964 .949 .952 .959 .959 (m = 200) Type II .118 .170 .139 .119 .109 .089 .117 Neural I 1- Type I .950 .969 .946 .962 .947 .930 .953 (m = 250) Type II .063 .075 .045 .041 .040 .065 .052 Neural II 1- Type I .956 .968 .965 .963 .956 .958 .943 (m = 200) Type II .292 .485 .346 .319 .297 .280 .290 Neural II 1- Type I .963 .980 .968 .950 .952 .960 .941 (m = 250) Type II .195 .323 .197 .189 .194 .169 .183 Subtype 1- Type I .975 .974 .977 .971 .966 .962 .966 (m = 10) Type II .055 .828 .237 .092 .042 .033 .024 Health st. 1- Type I .958 .980 .953 .940 .954 .954 .955 (m = 20) Type II .036 .037 .039 .081 .114 .120 .165

A.6. Further Experiments We assessed performance of two-sample tests based on A.5. distance kernels with various exponents and compared The notion of distance covariance extends naturally it to that of a Gaussian kernel on real-world multi- to that of distance variance V2(X) = V2(X,X) and variate datasets: Health st. (microarray data from that of distance correlation (in analogy to the Pearson normal and tumor tissues), Subtype (microarray data product-moment correlation coefficient): from different subtypes of cancer) and Neural I/II (lo- cal field potential (LFP) electrode recordings from the 2 Macaque primary visual cortex (V1) with and without V (X,Y ) , V(X)V(Y ) > 0, R2(X, Y )= V(X)V(Y ) spike events), all discussed in Gretton et al. (2009a). (0, V(X)V(Y )=0. In contrast to Gretton et al. (2009a), we used smaller sample sizes, so that some Type II error persists. At Distance correlation also has a straightforward inter- higher sample sizes, all tests exhibit Type II error pretation in terms of kernels as: which is virtually zero. The results are reported in Table 1 below. We used the spectral test for all ex- V2(X, Y ) periments, and the reported averages are obtained by R2(X, Y )= V(X)V(Y ) running 1000 trials. We note that for dataset Subtype which is high dimensional but with only a small num- γ2(P , P P ) = k XY X Y ber of dimensions varying in mean, a larger exponent γk(PXX , PX PX )γk(PYY , PY PY ) results in a test of greater power. kΣ k2 = XY HS , kΣXX kHS kΣYY kHS

where covariance operator ΣXY : HkX → HkY is a linear operator for which hΣXY f,gi = HkY E E E XY [f(X)g(Y )] − X f(X) Y g(Y ), for all f ∈ HkX and g ∈ HkY , and k·kHS denotes the Hilbert-Schmidt norm (Gretton et al., 2005b). It is clear that R is in- variant to scaling (X, Y ) 7→ (ǫX,ǫY ), ǫ> 0, whenever the corresponding semimetrics are homogeneous, i.e., ′ ′ whenever ρX (ǫx, ǫx ) = ǫρX (x, x ), and similarly for ρY . Moreover, R is invariant to translations (X, Y ) 7→ ′ ′ ′ ′ (X + x , Y + y ), x ∈ X , y ∈ Y, whenever ρX and ρY are translation invariant.