<<

A Kernel Test for Three-Variable Interactions with Random Processes

Paul K. Rubenstein123 Kacper P. Chwialkowski3 Arthur Gretton4 1Machine Learning Group, University of Cambridge 2Empirical Inference, MPI for Intelligent Systems, Tübingen, Germany 3Department of Computer Science, University College London 4Gatsby Computational Neuroscience Unit, University College London [email protected], [email protected], [email protected]

Abstract seen far less analysis: tests of pairwise dependence have been proposed by Gaisser et al.[2010], Besserve et al. We apply a wild bootstrap method to the Lan- [2013], Chwialkowski et al.[2014], Chwialkowski and caster three-variable interaction measure in or- Gretton[2014], where the first publication also addresses der to detect factorisation of the joint distribution mutual independence of more than two univariate time se- on three variables forming a stationary random ries. The two final works use as their statistic the Hilbert- process, for which the existing permutation boot- Schmidt Indepenence Criterion, a general nonparametric strap method fails. As in the i.i.d. case, the Lan- measure of dependence [Gretton et al., 2005], which ap- caster test is found to outperform existing tests in plies even for multivariate or non-Euclidean variables (such cases for which two independent variables indi- as strings and groups). The asymptotic behaviour and cor- vidually have a weak influence on a third, but that responding test threshold are derived using particular as- when considered jointly the influence is strong. sumptions on the mixing properties of the processes from The main contributions of this paper are twofold: which the observations are drawn. These kernel approaches first, we prove that the Lancaster statistic satis- apply only to pairs of random processes, however. fies the conditions required to estimate the quan- The Lancaster interaction is a signed measure that can be tiles of the null distribution using the wild boot- used to construct a test statistic capable of detecting depen- strap; second, the manner in which this is proved dence between three random variables [Lancaster, 1969, is novel, simpler than existing methods, and can Sejdinovic et al., 2013]. If the joint distribution on the further be applied to other . three variables factorises in some way into a product of a marginal and a pairwise marginal, the Lancaster interaction is zero everywhere. Given observations, this can be used to 1 INTRODUCTION construct a statistical test, the null hypothesis of which is that the joint distribution factorises thus. In the i.i.d. case, Nonparametric testing of independence or interaction be- the null distribution of the test statistic can be estimated tween random variables is a core staple of machine learn- using a permutation bootstrap technique: this amounts to ing and statistics. The majority of nonparametric sta- shuffling the indices of one or more of the variables and re- tistical tests of independence for continuous-valued ran- calculating the test statistic on this bootstrapped data set. dom variables rely on the assumption that the observed When our samples instead exhibit temporal dependence, data are drawn i.i.d. Feuerverger[1993], Gretton et al. shuffling the time indices destroys this dependence and thus [2007], Székely et al.[2007], Gretton and Gyorfi[2010], doing so does not correspond to a valid resample of the test Heller et al.[2013]. The same assumption applies to statistic. tests of conditional dependence, and of multivariate inter- Provided that our data-generating process satisfies some action between variables Zhang et al.[2011], Kankainen technical conditions on the forms of temporal dependence, and Ushakov[1998], Fukumizu et al.[2008], Sejdinovic recent work by Leucht and Neumann[2013], building on et al.[2013], Patra et al.[2015]. For many applications the work of Shao[2010], can come to our rescue. The wild in finance, medicine, and audio signal analysis, however, bootstrap is a method that correctly resamples from the null the i.i.d. assumption is unrealistic and overly restrictive. distribution of a test statistic, subject to certain conditions While many approaches exist for testing interactions be- on both the test statistic and the processes from which the tween under strong parametric assumptions observations have been drawn. Kirchgässner et al.[2012], Ledford and Tawn[1996], the problem of testing for general, nonlinear interactions has In this paper we show that the Lancaster interaction test statistic satisfies the conditions required to apply the wild tion4, we evaluate the Lancaster test on synthetic data to bootstrap procedure; moreover, the manner in which we identify cases in which it outperforms existing methods, as prove this is significantly simpler than existing proofs in the well as cases in which it is outperformed. In Section6, we literature of the same property for other kernel test statis- provide proofs of the main results of this paper, in particu- tics [Chwialkowski et al., 2014, Chwialkowski and Gretton, lar the aforementioned novel proof. Further proofs may be 2014]. Previous proofs have relied on the classical theory found in the Supplementary material. of V -statistics to analyse the asymptotic distribution of the kernel statistic. In particular, the Hoeffding decomposition 2 LANCASTER INTERACTION TEST gives an expression for the kernel test statistic as a sum of other V -statistics. Understanding the asymptotic properties 2.1 KERNEL NOTATION of the components of this decomposition is then conceptu- ally tractable, but algebraically extremely painful. More- Throughout this paper we will assume that the kernels over, as the complexity of the test statistic under analysis k, l, m, defined on the domains X , Y and Z respectively, grows, the number of terms that must be considered in this are characteristic [Sriperumbudur et al., 2011], bounded approach grows factorially.1 We conjecture that such anal- and Lipschitz continuous. We describe some notation rel- ysis of interaction statistics of 4 or more variables would in evant to the kernel k; similar notation holds for l and m. practice be unfeasible without automatic theorem provers Recall that µX := EX k(X, ·) ∈ Fk is the mean embed- due to the sheer number of terms in the resulting computa- ding [Smola et al., 2007] of the X. Given tions. observations Xi, an estimate of the mean embedding is µ˜ = 1 Pn k(X , ·) k In contrast, in the approach taken in this paper we explic- X n i=1 i . Two modifications of are used itly consider our test statistic to be the norm of a Hilbert in this work: ¯ 0 0 space operator. We exploit a Central Limit Theorem for k(x, x ) = hk(x, ·) − µX , k(x , ·) − µX i, (1) Hilbert space valued random variables Dehling et al.[2015] ˜ 0 0 to show that our test statistic converges in probability to k(x, x ) = hk(x, ·) − µ˜X , k(x , ·) − µ˜X i (2) the norm of a related population-centred Hilbert space op- These are called the population centered kernel and empir- erator, for which the asymptotic analysis is much simpler. ically centered kernel respectively. Our approach is novel; previous analyses have not, to our knowledge, leveraged the Hilbert space geometry in the 2.2 LANCASTER INTERACTION context of statistical hypothesis testing using kernel V - statistics in this way. The Lancaster interaction on the triple of random vari- ables (X,Y,Z) is defined as the signed measure ∆ P = We propose that our method may in future be applied to L − − − + 2 . This the asymptotic analysis of other kernel statistics. In the PXYZ PXY PZ PXZ PY PX PYZ PX PY PZ measure can be used to detect three-variable interactions. appendix, we provide an application of this method to the It is straightforward to show that if any variable is indepen- Hilbert Schmidt Independence Criterion (HSIC) test statis- dent of the other two (equivalently, if the joint distribution tic, giving a significantly shorter and simpler proof than factorises into a product of marginals in any way), that given in Chwialkowski and Gretton[2014] PXYZ then ∆LP = 0. That is, writing HX = {X ⊥⊥ (Y,Z)} and The Central Limit Theorem that we use in this paper makes similar for HY and HZ , we have that certain assumptions on the mixing properties of the ran- dom processes from which our data are drawn; as further H ∨ H ∨ H ⇒ ∆ P = 0 (3) progress is made, this may be substituted for more up-to- X Y Z L date theorems that make weaker mixing assumptions. The reverse implication does not hold, and thus no con- clusion about the veracity of the H· can be drawn when OUTLINE: In Section2, we detail the Lancaster interac- ∆LP = 0. Following Sejdinovic et al.[2013], we can con- tion test and provide our main results. These results justify sider the mean embedding of this measure: use of the wild bootstrap to understand the null distribu- Z tion of the test statistic. In Section3, we provide more µL = k(x, ·)l(y, ·)m(z, ·)∆LP (4) detail about the wild bootstrap, prove that its use correctly controls Type I error and give a consistency result. In Sec- n Given an i.i.d. sample (Xi,Yi,Zi)i=1, the norm of the mean embedding µ can be empirically estimated using 1See for example Lemma 8 in Supplementary material A.3 L of Chwialkowski and Gretton[2014]. The proof of this lemma empirically centered kernel matrices. For example, for the requires keeping track of 4! terms; an equivalent approach for the kernel k with kernel matrix Kij = k(Xi,Xj), the empiri- Lancaster test would have 6! terms. Depending on the precise cally centered kernel matrix K˜ is given by structure of the statistic, this approach applied to a test involving 4 variables could require as many as 8! = 40320 terms. K˜ij = hk(Xi, ·) − µ˜X , k(Xj, ·) − µ˜X i, By Sejdinovic et al.[2013], an estimator of the norm of the Our first contribution is to show that the (difficult) study of 2 mean embedding of the Lancaster interaction for i.i.d. sam- the asymptotic null distribution of kµˆLk can be reduced to ples is studying population centered kernels 1   2 ˜ ˜ ˜ 1   kµˆLk = 2 K ◦ L ◦ M (5) (Z) 2 n ++ kµˆL,2k = K ◦ L ◦ M n2 ++ where ◦ is the Hadamard (element-wise) product and P where e.g. A++ = ij Aij, for a matrix A. Kij = hk(Xi, ·) − µX , k(Xj, ·) − µX i, 2.3 TESTING PROCEDURE Specifically, we prove the following: n In this paper, we construct a statistical test for three- Theorem 1. Suppose that (Xi,Yi,Zi)i=1 are drawn from 2 variable interaction, using nkµˆLk as the test statistic to a random process satisfying suitable mixing assumptions. distinguish between the following hypotheses: (Z) 2 2 Under HZ , limn→∞(nkµˆL,2k − nkµˆLk ) = 0 in proba- bility. H0 : HX ∨ HY ∨ HZ H : does not factorise in any way 1 PXYZ Our proof of Theorem1 relies crucially on the following The null hypothesis H0 is a composite of the three ‘sub- Lemma which we prove in Supplementary material A.1 hypotheses’ HX , HY and HZ . We test H0 by testing each n Lemma 1. Suppose that (Xi)i=1 is drawn from a random of the sub-hypotheses separately and we reject if and only if process satisfying suitable mixing assumptions and that k we reject each of HX , HY and HZ . Hereafter we describe − 1 is a bounded kernel on X . Then kµˆX −µX kk = OP (n 2 ) the procedure for testing HZ ; similar results hold for HX H and Y . Proof. (Theorem1) We provide a short sketch of the proof 2 Sejdinovic et al.[2013] show that, under HZ , nkµˆLk con- here; for a full proof, see Section6. χ verges to an infinite sum of weighted -squared random The key idea is to note that we can rewrite nkµˆ k2 in terms i.i.d. L variables. By leveraging the assumption of the sam- of the population centred kernel matrices K, L and M. ples, any given quantile of this distribution can be estimated Each of the resulting terms can in turn be converted to an using simple permutation bootstrap, and so a test procedure inner product between quantities of the form µˆ − µ, where is proposed. µˆ is an empirical estimator of µ, and each µ is a mean em- In the time series setting this approach does not work. Tem- bedding or covariance operator. poral dependence within the samples makes study of the By applying Lemma1 to the µˆ − µ, we show that most of nkµˆ k2 asymptotic distribution of L difficult; in Section 4.2 these terms converge in probability to 0, with the residual we verify experimentally that the permutation bootstrap terms equaling nkµˆ(Z)k2. used in the i.i.d case fails. To construct a test in this set- L,2 ting we will use asymptotic and bootstrap results for mix- ing processes. As discussed in Section1, the essential idea of this proof is novel and the resulting proof is significantly more con- Mixing formalises the notion of the temporal structure cise than previous approaches [Chwialkowski and Gretton, within a process, and can be thought of as the rate at which 2014, Chwialkowski et al., 2014]. the process forgets about its past. For example, for Gaus- (Z) 2 sian processes this rate can be captured by the autocorrela- Theorem1 is useful because the statistic kµˆL,2k is much 2 tion function; for general processes, generalisations of au- easier to study under the non-i.i.d. assumption than kµˆLk . tocorrelation are used. The exact assumptions we make Indeed, it can expressed as a V -statistic (see Section 3.2) about the mixing properties of processes in this paper are 1 X discussed in Supplementary material A.7. For brevity in Vn = k ⊗ l ⊗ m(Si,Sj) n2 statements of results throughout this paper, however, we 1≤i,j≤n define sufficient suitable mixing assumptions in Section3. where Si = (Xi,Yi,Zi). The crucial observation is that

2.4 MAIN RESULTS h := k ⊗ l ⊗ m

It is straightforward to show that the norm of the mean em- is well behaved in the following sense. bedding (5) can also be written as Theorem 2. Suppose that k, l and m are bounded, symmet- ric, Lipschitz continuous kernels. Then h is also bounded 1   symmetric and Lipschitz continuous, and is moreover de- kµˆ k2 = K^˜ ◦ L˜ ◦ M˜ L 2 generate under H i.e h(S, s) = 0 for any fixed s. n ++ Z ES Algorithm 1 Test HZ with Wild Bootstrap Therefore rejecting H0 in the event that each test has p- Input: K˜ , L˜, M˜ , each size n × n, N= number of boot- value less than α individually guarantees a Type I error straps, α = p-value threshold overall of at most α. In contrast, the Holm-Bonferonni !   method requires that the sorted p-values be lower than 2 1 ^˜ ˜ ˜ α α nkµˆLk = n K ◦ L ◦ M [ 3 , 2 , α] in order to reject the null hypothesis overall. It ++ is therefore more conservative than necessary and thus has samples = zeros(1,N) worse test power compared to the ‘simple correction’ pro- for i = 1 to N do posed here. This is experimentally verified in Section4. Draw random vector W according to Equation6 !   1 | ^˜ ˜ ˜ samples[i] = n W K ◦ L ◦ M W (∗) 3 THE WILD BOOTSTRAP end for In this section we discuss the wild bootstrap and provide if sum(nkµˆ k2 > samples)> α then L N consistency and Type I error results for the proposed Lan- H Reject Z caster test. else Do not reject HZ end if 3.1 TEMPORAL DEPENDENCE There are various formalisations of memory or ‘mixing’ of Proof. See Section6 a random process [Doukhan, 1994, Bradley et al., 2005, Dedecker et al., 2007]; of relevance to this paper is the fol- lowing : The asymptotic analysis of such a V -statistic for non- i.i.d. data is still complex, but we can appeal to prior work: Definition 1. A process (Xt)t is β-mixing (also known as Leucht and Neumann[2013] showed a way to estimate any absolutely regular) if β(m) −→ 0 as m −→ ∞, where given quantile of such a V -statistic under the null hypothe- I J sis using a method called the wild bootstrap (introduced in 1 X X β(m) = sup sup |P(Ai ∩ Bj) − P(Ai)P(Bj)| Section3). This, combined with analysis of the V -statistic 2 n under the alternative hypothesis provided in Theorem 2 i=1 j=1 2 of Chwialkowski et al.[2014] , results in a statistical test where the second supremum is taken over all finite par- given in Algorithm1. The wild bootstrap is used in line (∗) titions {A1,...,AI } and {B1,...,BJ } of the sample to generate samples of the null distribution. n ∞ c space such that Ai ∈ F1 and Bj ∈ Fn+m and Fb = In Section3 we discuss the wild bootstrap and provide re- σ(Xb,Xb+1,...,Xc) sults regarding consistency and Type I error control. SUITABLE MIXING ASSUMPTIONS 2.5 MULTIPLE TESTING CORRECTION We assume that the random process Si = (Xi,Yi,Zi) In the Lancaster test, we reject the composite null hypoth- is β mixing with mixing coefficients satisfying β(m) = −6 esis H0 if and only if we reject all three of the compo- o(m ). Throughout this paper we refer to this assump- nents. In Sejdinovic et al.[2013], it is suggested that the tion as suitable mixing assumptions. For a more in depth Holm-Bonferroni correction be used to account for multi- discussion of the mixing assumptions made in this paper, ple testing [Holm, 1979]. We show here that more relaxed see Supplementary materials A.7. conditions on the p-values can be used while still bounding the Type I error, thus increasing test power. 3.2 V -STATISTICS Denote by A the event that H is rejected. Then ∗ ∗ A V -statistic of a 2-argument, symmetric function h given observations Sn = {S1,...,Sn} is [Serfling, 2009]: P(A0) = P(AX ∧ AY ∧ AZ ) ≤ min{P(AX ), P(AY ), P(AZ )} 1 X Vn = 2 h(Si,Sj) If H0 is true, then so must one of the components. Without n 1≤i,j≤n loss of generality assume that HX is true. If we use signifi- cance levels of α in each test individually then P(AX ) ≤ α We call nVn a normalised V -statistic. We call h the and thus P(A0) ≤ α. core of V and we say that h is degenerate if, for any s1, 2 Note that similar results are presented in Leucht and Neu- ES2∼P[h(s1,S2)] = 0, in which case we say that V is a mann[2013] as specific cases. degenerate V -statistic. Many kernel test statistics can be viewed as normalised V -statistics which, under the null hy- control the Type I error when testing HZ . pothesis, are degenerate. As mentioned in the previous sec- n Theorem 3. Suppose that (Xi,Yi,Zi) are drawn from (Z) 2 i=1 tion, kµˆL,2k is a V -statistic. Theorems1 and2 together a random process satisfying suitable mixing conditions, imply that, under HZ , it can be treated as a degenerate V - and that W is drawn from a process satisfying (B2) in statistic. Leucht and Neumann[2013]. Then asymptotically, the quantiles of 3.3 WILD BOOTSTRAP 1   nV = W | K¯ ◦ L¯ ◦ M¯ W If the test statistic has the form of a normalised V -statistic, b n then provided certain extra conditions are met, the wild 2 bootstrap of Leucht and Neumann[2013] is a method to converge to those of nkµˆLk . directly resample the test statistic under the null hypothe- sis. These conditions can be categorised as concerning: (1) Proof. See Supplementary material A.3 appropriate mixing of the process from which our observa- tions are drawn; (2) the core of the V -statistic. 3.6 (SEMI-)CONSISTENCY OF TESTING PROCEDURE The condition on the core that is of crucial importance to this paper is that it must be degenerate. Theorem2 justifies Note that in order to achieve consistency for this test, we our use of the wild bootstrap in the Lancaster interaction would need that H0 ⇐⇒ ∆LP = 0. Unfortunately test. this does not hold - in Sejdinovic et al.[2013] examples are given of distributions for which H is false, and yet Given the statistic nVn, Leucht and Neumann[2013] tells 0 us that a random vector W of length n can be drawn such ∆LP = 0. 3 that the bootstrapped statistic However, the following result does hold: 1 X nV = W h(S ,S )W Theorem 4. Suppose that ∆LP 6= 0. Then as n −→ ∞, b n i i j j i,j the probability of correctly rejecting H0 converges to 1. is distributed according to the null distribution of nV . n Proof. See Supplementary material A.4 By generating many such W and calculating nVb for each, we can estimate the quantiles of nV . At the time of writing, a characterisation of distributions for which H0 is false yet ∆LP = 0 is unknown. There- 3.4 GENERATING W fore, if we reject H0 then we conclude that the distribution does not factorise; if we fail to reject H0 then we cannot The process generating W must satisfy conditions (B2) conclude that the distribution factorises. given on page 6 of Leucht and Neumann[2013] for nVb to correctly resample from the null distribution of nVn. For brevity, we provide here only an example of such a pro- 4 EXPERIMENTS cess; the interested reader should consult Leucht and Neu- mann[2013] or Appedix A of Chwialkowski et al.[2014] The Lancaster test described above amounts to a method to for a more detailed discussion of the bootstrapping process. test each of the sub-hypotheses HX , HY , HZ . Rather than The following bootstrapping process was used in the exper- using the Lancaster test statistic with wild bootstrap to test iments in Section4: each of these, we could instead use HSIC. For example, by considering the pair of variables (X,Y ) and Z with ker- −1/ln p −2/l Wt = e Wt−1 + 1 − e n t (6) nels k ⊗ l and m respectively, HSIC can be used to test HZ . Similar grouping of the variables can be used to test where W1, 1, . . . , t are independent N (0, 1) random vari- HX and HY . Applying the same multiple testing correc- ables. ln should be taken from a sequence {ln} such that tion as in the Lancaster test, we derive an alternative test of limn−→∞ ln = ∞; in practice we used ln = 20 for all of the experiments since the values of n were roughly compa- dependence between three variables. We refer to this HSIC rable in each case. based procedure as 3-way HSIC. In the case of i.i.d. observations, it was shown in Sejdinovic 3.5 CONTROL OF TYPE I ERROR et al.[2013] that Lancaster statistical test is more sensi- tive to dependence between three random variables than the The following theorem shows that by estimating the quan- above HSIC-based test when pairwise interaction is weak tiles of the wild bootstrapped statistic nVb we correctly but joint interaction is strong. In this section, we demon- 3 Note that for fixed Sn, nVb is a random variable through the strate that the same is true in the time series case on syn- randomness introduced by W thetic data. 4.1 WEAK PAIRWISE INTERACTION, STRONG JOINT INTERACTION Power of HSIC and Lancaster joint factorisation tests 1

This experiment demonstrates that the Lancaster test has 0.9 greater power than 3-way HSIC when the pairwise interac- tion is weak, but joint interaction is strong. 0.8 Synthetic data were generated from autoregressive pro- 0.7 cesses X, Y and Z according to: 0.6

1 0.5 Xt = Xt−1 + t Power 2 0.4 1 Y = Y + η t 2 t−1 t 0.3 1 0.2 Zt = Zt−1 + d|θt|sign(XtYt) + ζt Lancaster (S) 2 Lancaster (HB) 0.1 3−way HSIC (S) 3−way HSIC (HB) 0 where X0,Y0,Z0, t, ηt, θt and ζt are i.i.d. N (0, 1) random 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 Dependence coefficient variables and d ∈ R, called the dependence coefficient, de- termines the extent to which the process (Zt)t is dependent on (X ,Y ) . Figure 1: Results of experiment in Section 4.1. (S) refers t t t to the simple multiple correction; (HB) refers to Holm- Data were generated with varying values of d. For each Bonferroni. The Lancaster test is more sensitive to de- value of d, 300 datasets were generated, each consisting pendence than 3-way HSIC, and test power for both tests of 1200 consecutive observations of the variables. Gaus- is higher when using the simple correction rather than the sian kernels with bandwidth 1 were used on each Holm-Bonferroni multiple testing correction. variable, and 250 bootstrapping procedures were used for each test on each dataset. Observe that the random variables are pairwise indepen- dent but jointly dependent. Both the Lancaster and 3-way where X0,Y0,Z0, t, ηt and ζt are i.i.d. N (0, 1) random HSIC tests should be able to detect the dependence and variables and a, called the dependence coefficient, deter- therefore reject the null hypothesis in the limit of infinite mines how temporally dependent the processes are. The data. In the finite data regime, the value of d affects dras- null hypothesis in this example is true as each process is tically how hard it is to detect the dependence. The results independent of the others. of this experiment are presented in Figure1, which shows that the Lancaster test achieves very high test power with The Lancaster test was performed using both the Wild weak dependence coefficients compared to 3-way HSIC. Bootstrap and the simple permutation bootstrap (used in the Note also that when using the simple multiple testing cor- i.i.d. case) in order to sample from the null distributions of rection a higher test power is achieved than with the Holm- the test statistic. We used a fixed desired false positive rate Bonferroni correction. α = 0.05 with sample of size 1000, with 200 experiments run for each value of a. Figure2 shows the false positive rates for these two methods for varying a. It shows that 4.2 FALSE POSITIVE RATES as the processes become more dependent, the false positive This experiment demonstrates that in the time series case, rate for the permutation method becomes very large, and is existing permutation bootstrap methods fail to control the not bounded by the fixed α, whereas the false positive rate Type I error, while the wild bootstrap correctly identifies for the Wild Bootstrap method is bounded by α. test statistic thresholds and appropriately controls Type I error. Synthetic data were generated from autoregressive pro- 4.3 STRONG PAIRWISE INTERACTION cesses X, Y and Z according to:

X = aX +  t t−1 t This experiment demonstrates a limitation of the Lancaster Yt = aYt−1 + ηt test. When pairwise interaction is strong, 3-way HSIC has Zt = aZt−1 + ζt greater test power than Lancaster. Synthetic data were generated from autoregressive pro- False positives − Wild Bootstrap vs Permutation Power of HSIC and Lancaster joint factorisation tests 0.8 1 Lancaster with Wild Bootstrap Lancaster with Permutation Bootstrap 0.9 0.7 Desired type I error bound (0.05)

0.8 0.6 0.7

0.5 0.6

0.4 0.5 Power Type I error 0.3 0.4

0.3 0.2 0.2 Lancaster (S) 0.1 Lancaster (HB) 0.1 3−way HSIC (S) 3−way HSIC (HB) 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 Dependence coefficient Dependence coefficient

Figure 2: Results of experiment in section 4.2. Whereas Figure 3: Results of experiment in Section 4.3. (S) refers the wild bootstrap succeeds in controlling the Type I error to the simple multiple correction; (HB) refers to Holm- across all values of the dependence coefficient, the permu- Bonferroni. The Lancaster test is less sensitive to depen- tation bootstrap fails to control the Type I error as it does dence than 3-way HSIC, and test power in both cases is not sample from the correct null distribution as temporal higher when using the simple correction rather than the dependence between samples increases. Holm-Bonferroni multiple testing correction. cesses X, Y and Z according to: 4.4 FOREX DATA

1 Exchange rates between three currencies (GBP, USD, Xt = Xt−1 + t 2 EUR) at 5 minute intervals over 7 consecutive trading days 1 were obtained. The data were processed by taking the re- Yt = Yt−1 + ηt 2 turns (difference between consecutive terms within each r 1 time series, x = xt − xt−1) which were then normalised Zt = Zt−1 + d(Xt + Yt) + ζt t 2 (divided by standard deviation). We performed the Lan- caster test, 3-way HSIC and pairwise HSIC on using the first 800 entries of each processed series. All tests rejected where X0,Y0,Z0, t, ηt and ζt are i.i.d. N (0, 1) random the null hypothesis. The Lancaster and 3-way HSIC tests variables and d ∈ R, called the dependence coefficient, de- both returned p-values of 0 for each of HX , HY and HZ termines the extent to which the process (Zt)t is dependent with 10000 bootstrapping procedures. on X and Y . t t We then shifted one of the time series and repeated the tests Data were generated with varying values for the depen- (i.e. we used entries 1 to 800 of two of the processed series dence coefficient. For each value of d, 300 datasets were and entries 801 to 1600 of the third). In this case, pairwise generated, each consisting of 1200 consecutive observa- HSIC still detected dependence between the two unshifted tions of the variables. Gaussian kernels with bandwidth time series, and both Lancaster and 3-way HSIC did not re- parameter 1 were used on each variable, and 250 bootstrap- ject the null hypothesis that the joint distribution factorises. ping procedures were used for each test on each dataset. The Lancaster test returned p-values of 0.2708, 0.2725 and 0.1975 for H , H and H respectively. 3-way HSIC In this case Z is pairwise-dependent on both of X and Y , X Y Z t t t resturn p-values of 0.3133, 0.0000 and 0.0000 respectively. in addition to all three variables being jointly dependent. Both the Lancaster and 3-way HSIC tests should be capable In both cases, the Lancaster test behaves as expected. Due of detecting the dependence and therefore reject the null to arbitrage, any two exchange rates should determine the hypothesis in the limit of infinite data. The results of this third and the Lancaster test correctly identifies a joint de- experiment are presented in Figure3, which demonstrates pendence in the returns. However, when we shift one of that in this case the 3-way HSIC test is more sensitive to the time series, we break the dependence between it and the dependence than the Lancaster test. the other series. Lancaster correctly identifies here that the underlying distribution does factorise. we can therefore expand K˜ in terms of K¯ as

K˜ij 5 DISCUSSION AND FUTURE 1 X 1 X = hφ (X ) − φ (X ), φ (X ) − φ (X )i RESEARCH X i n X k X j n X k k k 1 X 1 X = hφ¯ (X ) − φ¯ (X ), φ¯ (X ) − φ¯ (X )i We demonstrated that the Lancaster test is more sensitive X i n X k X j n X k than 3-way HSIC when pairwise interaction is weak, but k k 1 X 1 X 1 X that the opposite is true when pairwise interaction is strong. = K¯ − K¯ − K¯ + K¯ ij n ik n jk n2 kl It is curious that the two tests have different strengths in k k kl this manner, particularly when considering the very similar forms of the statistics in each case. Indeed, to test H using Z and expanding L˜ and M˜ in a similar way, we can rewrite the Lancaster statistic, we bootstrap the following: the Lancaster test statistic as

2 1 2 nkµˆLk = (K¯ ◦ L¯ ◦ M¯ )++ − ((K¯ ◦ L¯)M¯ )++ ! n n2 ˆ 2 1 ^˜ ˜ ˜ 2 2 nk∆LP k = K ◦ L ◦ M − ((K¯ ◦ M¯ )L¯) − ((M¯ ◦ L¯)K¯ ) n 2 ++ 2 ++ ++ n n 1 1 + (K¯ ◦ L¯) M¯ + (K¯ ◦ M¯ ) L¯ n3 ++ ++ n3 ++ ++ while for the 3-way HSIC test we bootstrap: 1 2 + (L¯ ◦ M¯ ) K¯ + (M¯ K¯ L¯) n3 ++ ++ n3 ++ 2 ¯ ¯ ¯ 2 ¯ ¯ ¯ + 3 (KLM)++ + 3 (KML)++ 1   n n nHSICb = (K^◦ L) ◦ M˜ 4 4 n ++ + tr(K¯ ◦ L¯ ◦ M¯ ) − (K¯ L¯) M¯ n3 + + + n4 ++ ++ 4 4 − (K¯ M¯ ) L¯ − (L¯M¯ ) K¯ These two quantities differ only in the centring of K and L, n4 ++ ++ n4 ++ ++ amounting to constant shifts in the respective feature spaces 4 + K¯ L¯ M¯ of the kernels k and l. This difference has the consequence n5 ++ ++ ++ of quite drastically changing the types of dependency to which each statistic is sensitive. A formal characterisation ¯ ¯ ¯ of the cases in which the Lancaster statistic is more sensi- We denote by CXYZ = EXYZ [φX (X)⊗φY (Y )⊗φZ (Z)] tive than 3-way HSIC would be desirable. the population centred covariance operator with empirical ¯ 1 P ¯ ¯ ¯ estimate CXYZ = n i φX (Xi) ⊗ φY (Yi) ⊗ φZ (Zi). We define similarly the quantities CXY ,CYZX ,... with cor- 6 PROOFS responding empirical counterparts C¯XY , C¯YZX ,... where ¯ ¯ for example CYZ = EYZ [φY (Y ) ⊗ φZ (Z)] 2 An outline of the proof of Theorem1 was given in Sec- Each of the terms in the above expression for nkµˆLk can tion2; here we provide the full proof, as well as a proof of be expressed as inner products between empirical estimates Theorem2. of population centred covariance operators and tensor prod- ucts of mean embeddings. Rewriting them as such yields:

2 nkµˆLk = nhC¯XYZ , C¯XYZ i Proof. (Theorem1) − 2nhC¯XYZ , C¯XY ⊗ µ¯Z i By observing that − 2nhC¯XZY , C¯XZ ⊗ µ¯Y i − 2nhC¯ , C¯ ⊗ µ¯ i 1 X YZX YZ X φ (X ) − φ (X ) ¯ ¯ X i n X k + nhCXY ⊗ µ¯Z , CXY ⊗ µ¯Z i k + nhC¯XZ ⊗ µ¯Y , C¯XZ ⊗ µ¯Y i 1 X = (φ (X ) − µ ) − (φ (X ) − µ ) ¯ ¯ X i X n X k X + nhCYZ ⊗ µ¯X , CYZ ⊗ µ¯X i k + 2nhµ¯Z ⊗ C¯XY , C¯ZX ⊗ µ¯Y i ¯ 1 X ¯ = φX (Xi) − φX (Xk) . n . k . q q + 2nhµ¯ ⊗ C¯ , C¯ ⊗ µ¯ i X YZ XY Z = n hµ¯X , µ¯X ihC¯YZ , C¯YZ i hC¯XY , C¯XY ihµ¯Z , µ¯Z i + 2nhµ¯X ⊗ C¯ZY , C¯XZ ⊗ µ¯Y i = nkµ¯X kkC¯YZ kkC¯XY kkµ¯Z k + 4nhC¯XYZ , µ¯X ⊗ µ¯Y ⊗ µ¯Z i  1   1   1   1  = nOP √ OP √ OP (1)OP √ = OP √ − 4nhC¯XY ⊗ µ¯Z , µ¯X ⊗ µ¯Y ⊗ µ¯Z i n n n n

− 4nhC¯XZ ⊗ µ¯Y , µ¯X ⊗ µ¯Z ⊗ µ¯Y i It can be shown that K¯ ◦ L¯ in the above expression can be − 4nhC¯YZ ⊗ µ¯X , µ¯Y ⊗ µ¯Z ⊗ µ¯X i K¯ ◦ L¯ + 4nhµ¯ ⊗ µ¯ ⊗ µ¯ , µ¯ ⊗ µ¯ ⊗ µ¯ i replaced with while preserving equality. That is, we X Y Z X Y Z can equivalently write

By assumption, = and thus the expecta- 1 PXYZ PXY PZ nk∆ Pˆk2 −→ ((K¯ ◦ L¯) ◦ M¯ ) tion operator also factorises similarly. As a consequence, L n ++ CXYZ = 0. Indeed, given any A ∈ FX ⊗ FY ⊗ FZ , we 2 1 − ((K¯ ◦ L¯)M¯ )++ + (K¯ ◦ L¯)++M¯ ++ can consider A to be a bounded linear operator FZ −→ n2 n3 4 FX ⊗ FY . It follows that This is equivalent to treating k¯ ⊗ ¯l as a kernel on the single ¯ EXYZ hA, CXYZ i variable T := (X,Y ) and performing another recentering 1 X trick as we did at the beginning of this proof. By rewrit- = hA, φ¯ (X ) ⊗ φ¯ (Y ) ⊗ φ¯ (Z )i ¯ n EXY EZ X i Y i Z i ing the above expression in terms of the operator CTZ and i mean embeddings µT and µZ , it can be shown by a simi- 1 X ¯ ¯ ¯ lar argument to before that the latter two terms tend to 0 at = EXY EZ hφX (Xi) ⊗ φY (Yi),AφZ (Zi)iFX ⊗FY 1 n − 2 i least as OP (n ), and thus, substituting for the definition (Z) 1 X of kµˆ k2, = hφ¯ (X ) ⊗ φ¯ (Y ),A φ¯ (Z )i L,2 n EXY X i Y i EZ Z i FX ⊗FY i O ( √1 ) 2 P n (Z) 2 = 0 nkµˆLk −−−−−→ nkµˆL,2k C = C¯ = 0 We conclude that XYZ EXYZ XYZ . as required. Similarly, CXZY , CYZX , CXZ , CYZ are all 0 in their re- spective Hilbert spaces. Lemma2 tells us that each sub- Proof. (Theorem2) process of (Xi,Yi,Zi) satisfies the same β-mixing condi- Note that EXYZ = EXY EZ under HZ . Therefore, fixing tions as (Xi,Yi,Zi), thus by applying Lemma1 it follows any sj = (xj, yj, zj) we have that that kC¯XZY k, kC¯YZX k, kC¯XZ k, kC¯YZ k, kµ¯X k, kµ¯Y k,  − 1  kµ¯Z k = OP n 2 . Therefore ¯ ¯ ESi h(Si, sj) = EXiYi EZi k ⊗ l ⊗ m¯ (Si, sj) = h φ¯(X ) ⊗ φ¯(Y ) − C , φ¯(x ) ⊗ φ¯(y ) − C i − 1 EXiYi i i XY j j XY O (n 2 ) nkµˆ k2 −−−−−−→P nhC¯ , C¯ i ¯ ¯ L XYZ XYZ × hEZi φ(Zi), φ(zj)i ¯ ¯ ¯ ¯ ¯ ¯ − 2nhCXYZ , CXY ⊗ µ¯Z i − 2nhCXZY , CXZ ⊗ µ¯Y i = h0, φ(xj) ⊗ φ(yj) − CXY i 1 ¯ = ((K¯ ◦ L¯) ◦ M¯ ) × h0, φ(zj)i = 0 n ++ 2 1 Therefore h is degenerate. Symmetry follows from the − ((K¯ ◦ L¯)M¯ ) + (K¯ ◦ L¯) M¯ n2 ++ n3 ++ ++ symmetry of the Hilbert space inner product. For boundedness and Lipschitz continuity, it suffices to since all the other terms decay at least as quickly as show the two following rules for constructing new kernels O ( √1 ). This is shown here for nhµ¯ ⊗C¯ , C¯ ⊗µ¯ i; P n X YZ XY Z from old preserve both properties (see Supplementary ma- the proofs for the other terms are similar. terials A.5 for proof): nhµ¯X ⊗ C¯YZ , C¯XY ⊗ µ¯Z i • k 7→ k¯ ≤ nkµ¯X ⊗ C¯YZ kkC¯XY ⊗ µ¯Z k q q • (k, l) 7→ k ⊗ l = n hµ¯X ⊗ C¯YZ , µ¯X ⊗ C¯YZ i hC¯XY ⊗ µ¯Z , C¯XY ⊗ µ¯Z i It then follows that h = k¯ ⊗ ¯l ⊗ m¯ is bounded and Lips-

4 chitz continuous since it can be constructed from k, l and We can bring the EZ inside the inner product in the penulti- m using the two above rules. mate line due to the Bochner integrability of φ¯Z (Z), which fol- lows from the conditions required for µZ to exist [Steinwart and Christmann, 2008]. ORCID 0000-0003-3169-7624 References A. Kankainen and N. G. Ushakov. A consistent modifica- tion of a test for independence based on the empirical M. Besserve, N. Logothetis, and B. Schölkopf. Statistical characteristic function. Journal of Mathematical Scien- analysis of coupled time series with kernel cross-spectral cies, 89:1582–1589, 1998. density operators. In NIPS, pages 2535–2543, 2013. G. Kirchgässner, J. Wolters, and U. Hassler. Introduction to R. C. Bradley et al. Basic properties of strong mixing con- modern time series analysis. Springer Science & Busi- ditions. a survey and some open questions. Probability ness Media, 2012. surveys, 2(2):107–144, 2005. H. O. Lancaster. Chi-Square Distribution. Wiley Online K. Chwialkowski and A. Gretton. A kernel indepen- Library, 1969. dence test for random processes. arXiv preprint arXiv:1402.4501, 2014. A. W. Ledford and J. A. Tawn. Statistics for near indepen- K. P. Chwialkowski, D. Sejdinovic, and A. Gretton. A dence in multivariate extreme values. Biometrika, 83(1): wild bootstrap for degenerate kernel tests. In Advances 169–187, 1996. in neural information processing systems, pages 3608– A. Leucht and M. H. Neumann. Dependent wild bootstrap 3616, 2014. for degenerate u-and v-statistics. Journal of Multivariate J. Dedecker and C. Prieur. New dependence coefficients. Analysis, 117:257–280, 2013. examples and applications to statistics. Probability The- R. Patra, B. Sen, and G. Szekely. On a nonparametric ory and Related Fields, 132(2):203–236, 2005. notion of residual and its applications. Statist. Probab. J. Dedecker, P. Doukhan, G. Lang, L. R. J. Rafael, Lett., 106:208–213, 2015. S. Louhichi, and C. Prieur. Weak dependence. In Weak D. Sejdinovic, A. Gretton, and W. Bergsma. A kernel test Dependence: With Examples and Applications, pages 9– for three-variable interactions. In Advances in Neural In- 20. Springer, 2007. formation Processing Systems, pages 1124–1132, 2013. H. Dehling, O. S. Sharipov, and M. Wendler. Bootstrap for R. J. Serfling. Approximation theorems of mathematical dependent hilbert space-valued random variables with statistics, volume 162. John Wiley & Sons, 2009. Journal of Multivari- application to von mises statistics. X. Shao. The dependent wild bootstrap. Journal of ate Analysis , 133:200–215, 2015. the American Statistical Association, 105(489):218– P. Doukhan. Mixing. Springer, 1994. 235, 2010. A. Feuerverger. A consistent test for bivariate dependence. A. Smola, A. Gretton, L. Song, and B. Schölkopf. A International Statistical Review, 61(3):419–433, 1993. hilbert space embedding for distributions. In Algorith- K. Fukumizu, A. Gretton, X. Sun, and B. Schölkopf. Ker- mic Learning Theory, pages 13–31. Springer, 2007. nel measures of conditional dependence. In NIPS, pages B. K. Sriperumbudur, K. Fukumizu, and G. R. Lanckriet. 489–496, Cambridge, MA, 2008. MIT Press. Universality, characteristic kernels and rkhs embedding S. Gaisser, M. Ruppert, and F. Schmid. A multivariate ver- of measures. The Journal of Re- sion of hoeffding’s phi-square. Journal of Multivariate search, 12:2389–2410, 2011. Analysis, 101(10):2571–2586, 2010. I. Steinwart and A. Christmann. Support vector machines. A. Gretton and L. Gyorfi. Consistent nonparametric tests of Springer Science & Business Media, 2008. independence. Journal of Machine Learning Research, G. Székely, M. Rizzo, and N. Bakirov. Measuring and test- 11:1391–1423, 2010. ing dependence by correlation of distances. Annals of A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf. Statistics, 35(6):2769–2794, 2007. Measuring statistical dependence with hilbert-schmidt K. Zhang, J. Peters, D. Janzing, and B. Schoelkopf. Kernel- norms. In Algorithmic learning theory, pages 63–77. based conditional independence test and application in Springer, 2005. causal discovery. In Proceedings of the Conference on A. Gretton, K. Fukumizu, C. H. Teo, L. Song, Uncertainty in Artificial Intelligence (UAI), pages 804– B. Schölkopf, and A. J. Smola. A kernel statistical test of 813, 2011. independence. In Advances in Neural Information Pro- cessing Systems, pages 585–592, 2007. R. Heller, Y. Heller, and M. Gorfine. A consistent multi- variate test of association based on ranks of distances. Biometrika, 100(2):503–510, 2013. S. Holm. A simple sequentially rejective multiple test pro- cedure. Scandinavian journal of statistics, pages 65–70, 1979.