<<

Nonpar MANOVA via Independence Testing

Sambit Panda1,2, Cencheng Shen3, Ronan Perry1, Jelle Zorn4, Antoine Lutz4, Carey E. Priebe5 and Joshua T. Vogelstein1,2,6∗

Abstract. The k- testing problem tests whether or not k groups of points are sampled from the same distri- bution. Multivariate of (Manova) is currently the for k-sample testing but makes strong, often inappropriate, parametric assumptions. Moreover, independence testing and k-sample testing are tightly related, and there are many nonparametric multivariate independence tests with strong theoretical and em- pirical properties, including distance correlation (Dcorr) and Hilbert-Schmidt-Independence-Criterion (Hsic). We prove that universally consistent independence tests achieve universally consistent k-sample testing, and that k- sample like Energy and Maximum Discrepancy (MMD) are exactly equivalent to Dcorr. Empirically evaluating these tests for k-sample-scenarios demonstrates that these nonparametric independence tests typically outperform Manova, even for Gaussian distributed settings. Finally, we extend these non-parametric k-sample- testing procedures to perform multiway and multilevel tests. Thus, we illustrate the existence of many theoretically motivated and empirically performant k-sample-tests. A Python package with all independence and k-sample tests called hyppo is available from https://hyppo.neurodata.io/.

1 Introduction A fundamental problem in statistics is the k-sample testing problem. Consider the p p two-sample problem: we obtain two datasets ui ∈ for i = 1, . . . , n and vj ∈ R for j = 1, . . . , m. Assume each ui is sampled independently and identically (i.i.d.) from FU and that each vj is sampled i.i.d. from FV (and also that each ui and each vj is independent from one another). The two-sample testing problem tests whether the two datasets were sampled from the same distribution, that is,

H0 : FU = FV , (1.1) HA : FU 6= FV .

j p Eq. (1.1) can also be generalized to k samples: let ui ∈ R for j = 1, . . . , k and i = 1, . . . , nj be k datasets that are sampled i.i.d. from F1,...,Fk and independently from one another. Then,

H0 : F1 = F2 = ··· = Fk, (1.2) 0 HA : ∃ j 6= j s.t. Fj 6= Fj0

To approach the problem of two-sample testing, Student’s t-test [1] and its multivariate generaliza- tion Hotelling’s T 2 [2] is traditionally used, while a few nonparametric alternatives have been proposed that operate well on multivariate, nonlinear data such as Energy [3], and maximal mean discrepancy (Mmd)[4], and Heller Heller and Gorfine’s test [5]. The two-sample testing problem can be generalized to the k-sample testing problem and here (Anova)[6] or its multivariate analogue multivariate Anova (Manova)[7] can be used, but these statistics either fail to or operate poorly upon, multivariate and nonlinear data. Also, Anova and Manova in particular suffer from fundamental assumptions that are not generally present arXiv:1910.08883v3 [stat.ML] 2 Apr 2021 in real data [8,9]. Recently, a few nonparametric alternatives to Manova [10, 11] have been proposed, such as multivariate k-sample Heller Heller Gorfine [12] and distance components (Disco)[13]. Non- parametric tests similar to Manova are desirable, especially when the assumptions of Manova are not met [14]. A closely related problem to the k-sample testing problem is the independence testing problem. It p q iid is framed as follows: given xi ∈ R and yi ∈ R , and n samples of (xi, yi) ∼ FXY . The two random

∗Sambit Panda and Cencheng Shen contribute equally to this work. Corresponding author: [email protected]. 1 Department of Biomedical Engineering, Johns Hopkins University; 2 Institute for Computational Medicine, Johns Hopkins University; 3 Department of Applied Economics and Statistics, University of Delaware; 4 Lyon Neuroscience Research Centre, Lyon 1 University; 5 Department of Applied Mathematics and Statistics, Johns Hopkins University; 6 Center for Imaging Science, Kavli Neuroscience Discovery Institute, Johns Hopkins University; Progressive Learning 1 variables X and Y are independent if and only if FXY = FX FY . So, the independence testing problem can be stated as,

H0 : FXY = FX FY , (1.3) HA : FXY 6= FX FY . Many correlation measures have been proposed to approach the problem laid out in Eq. (1.3), such as Pearson’s correlation [15]. But as with k-sample tests, many are unsuited to detect nonlinear and high- dimensional dependence structures within data. Recently, several statistics have been proposed that operate well on high-dimensional (potentially non-Euclidean) data, such as distance correlation (Dcorr) [16–19] and Hilbert-Schmidt independence criterion (Hsic)[20–22], which are equivalent formulations by Sejdinovic et al. [23], Shen and Vogelstein [24]. Heller, Heller, and Gofrine proposed another non- parametric independence test (Hhg) with particularly high power in certain nonlinear relationships [5]. Multiscale Graph Correlation (Mgc) has demonstrated higher statistical power on many multivariate, nonlinear, and structured data when compared to other independence tests [25–27]. Mgc is statisti- cally efficient, requiring about half or one-third of the number of samples to achieve the same statistical power as other approaches [28]. Furthermore, Kernel Mean Embedding Random Forest (Kmerf), that utilizes Dcorr and an induced kernel similarity matrix from random forest, has been shown to have even larger gains in power [29]. For each of these tests, p-values can be calculated using a random permutation test [30–32]. We prove that independence tests can be used for consistent k-sample testing, the Energy method and Mmd method are equivalent to Dcorr and Hsic, and empirically evaluate the state-of-the-art in- dependence tests as k-sample tests. When compared to existing k-sample tests over a suite of linear and nonlinear simulations, we demonstrate that these independence tests, and specifically Kmerfand Mgc, have higher statistical power than the alternatives in nearly all settings. All the k-sample tests are provided in the hyppo statistical package [25]. 2 Preliminaries

2.1 Notation Let R denote the real line (−∞, ∞). Let FX , FY , and FXY refer to the marginal and joint distributions of random variables X and Y respectively. Let x and y refer to the samples from FX n×p m×p and FY and x ∈ R and y ∈ R refer to the matrix of observations of x and y respectively, that is, x = {x1, . . . , xn} and y = {y1, . . . , ym}. The trace of an n × n square matrix is the sum of the Pn elements along the main diagonal: tr(x) = i=1 xii. The performance of tests requires defining metrics to evaluate the effectiveness across various sample sizes and various dimensions. The testing power for a given level α (Type 1 error level) test is equal to the probability of correctly rejecting the null hypothesis when the alternative is true. For a test to be consistent, statistical power must converge to unity as the sample size increases to ∞. 2.2 Hotelling Hotelling is a generalization of Student’s t-test in arbritary dimension [2]. Con- iid iid sider input samples ui ∼ FU for i ∈ {1, . . . , n} and vi ∼ FV for i ∈ {1, . . . , m}. Let u¯ refer to the Pn columnwise of u; that is, u¯ = (1/n) i=1 ui and let v¯ be the same for v. Calculate sample T T T matrices Σˆ uv = u v and sample variance matrices Σˆ uu = u u and Σˆ vv = v v. Denote pooled Σˆ as

(n − 1)Σˆ + (m − 1)Σˆ Σˆ = uu vv n + m − 2 Then,

nm T ˆ −1 (2.1) Hotellingn,m(u, v) = (u¯ − v¯) Σ (u¯ − v¯) n + m Since it is a multivariate generalization of Student’s t-tests, it suffers from some of the same assumptions as Student’s t-tests. That is, the validity of Manova depends on the assumption that random variables 2 are normally distributed within each group and each with the same covariance matrix. Distributions of input data are generally not known and cannot always be reasonably modeled as Gaussian [33, 34], and having the same covariance across groups is also generally not true of real data. 2.3 Manova Manova is a procedures for comparing more than two multivariate samples [8, 35]. It is as a multivariate generalization of the Anova [8] using covariance matrices rather than the scalar . As in Rencher [36]: consider input samples x1, x2,..., xk that have the same dimen- sionality p. Each xi, where i ∈ {1, . . . , k} is assumed to be sampled from a multivariate distribution N(µi, Σ) and so each sample is assumed to have the same covariance matrix Σ. The model for each p-dimensional vector of each xi is defined as follows: for j ∈ {1, . . . , ni},

xij = µi + ij. In Manova, we are testing if the mean vectors of each of the k-samples are the same. That is, the null and alternate hypotheses are,

H0 : µ1 = µ2 = ··· = µk, 0 HA : ∃ j 6= j s.t. µj 6= µj0

Pni Let x¯i· refer to the columnwise means of xi; that is, x¯i· = (1/ni) j=1 xij. The pooled sample covariance of each group, W , is

k ni X X T (2.2) W = (xij − x¯i·)(xij − x¯i·) . i=1 j=1

Pk Next, define B as the sample covariance matrix of the means. If n = i=1 ni and the grand mean Pk Pn is x¯·· = (1/n) i=1 j=1 xij,

k X T (2.3) B = ni(x¯i· − x¯··)(x¯i· − x¯··) . i=1 Some of the most common statistics used when performing Manova include the Wilks’ Lambda, the Lawley-Hotelling trace, Roy’s greatest root, and Pillai-Bartlett trace (PBT) [37–39] (PBT is recognized to be the best of these as it is the most conservative [8, 40]) and Olson [41] has shown that there are minimal differences in statistical power among these statistics. Let λ1, λ2, . . . , λs refer to the eigenval- −1 ues of W B. Here s = min(νB, p) is the minimum between the degrees of freedom of B, νB and p. So, the PBT Manova test can be written as [35], s X λi −1 (2.4) Manovan ,...,n (x, y) = = tr B(B + W ) . 1 k 1 + λ i=1 i Manova is closely related to Hotelling, and as such, it suffers from the same assumptions that Hotelling does. 2.4 Independence Tests Here we highlight a few independence tests; for more details see Panda et al. [25]. Dcorr Dcorr is a powerful test to determine linear and nonlinear associations between two ran- dom variables or vectors in arbitrary dimensions. The test statistic can be determined as follows: let Dx y 1 be the n × n distance matrix of x and D be the n × n distance matrix of y. Let H = I − n J denote the n × n centering matrix where I is the identity matrix and J is the matrix of ones. The distance covariance (Dcov) and distance correlation (Dcorr) can then be defined as [18],

1 x y (2.5) Dcovn(x, y) = tr(HD HHD H). n2 3 Dcovn(x, y) (2.6) Dcorrn(x, y) = p ∈ [−1, 1]. Dcovn(x, x) · Dcovn(y, y)

The statistics presented in equations (2.5) and (2.6) are biased; fortunately, unbiased distance correla- tion test statistics have also been developed [42]. Define another modified matrix Cx such that,

( x 1 Pn x 1 Pn x 1 Pn x x Dij − n−2 t=1 Dit − n−2 t=1 Dtj + (n−1)(n−2) t=1 Dtt i 6= j Cij = , 0 otherwise and define Cy similarly. Then, the unbiased distance covariance (Dcov) and unbiased distance corre- lation (Dcorr) is [42],

1 x y (2.7) Dcovn(x, y) = tr(C C ). n(n − 3)

Dcovn(x, y) (2.8) Dcorrn(x, y) = p ∈ [−1, 1]. Dcovn(x, x) · Dcovn(y, y)

Since the statistics presented in equations (2.7) and (2.8) provide similar empirical results to the biased statistics [28], from now on any reference to distance correlation will refer to the unbiased distance correlation. Hsic Hilbert-Schmidt independence criterion (Hsic) is a closely related test that exchanges dis- tance matrices Dx and Dy for kernel similarity matrices Kx and Ky. They are exactly equivalent in the sense that every valid kernel has a corresponding valid semimetric to ensure their equivalence, and vice versa [23, 24]. In other words, every Dcorr test is also an Hsic test and vice versa. nonetheless, implementations of Dcorr and Hsic use different metrics by default: Dcorr uses a Euclidean distance while Hsic uses a Gaussian kernel similarity. Mgc Building upon the ideas of Dcorr, Hsic, and k-nearest neighbors, Mgc preserves the consis- tency property while typically working better in multivariate and non-monotonic relationships [28]. The Mgc test statistic is computed as follows: 1. Two distance matrices Dx and Dy are computed, and modified to be mean zero column- wise. This results in two n × n distance matrices Cx and Cy (the centering and unbiased modification is slightly different from the unbiased modification in the previous section, see [26] for more details). 2. For all values k and l from 1, . . . , n, (a) The k-nearest neighbor and l-nearest neighbor graphs are calculated for each property. x Here, Gk(i, j) has value 1 for the k smallest values of the i-th row of D and Hl(i, j) has value 1 the l smallest values of the i-th row of Dy. All other values in both matrices is 0. (b) The local correlations are summed and normalized using the following statistic:

P x y D (i, j)Gk(i, j)D (i, j)Hl(i, j) ckl = ij , p x 2 p y 2 (D (i, j)) Gk(i, j) · (D (i, j)) Hl(i, j)

3. The Mgc test statistic is the smoothed optimal local correlation of ckl . Denote the smoothing operation as R(·) (which essentially set all isolated large correlations as 0 and connected large correlations same as before, see [26]), Mgc is

kl (2.9) Mgcn(x, y) = max R(c (xn, yn)). (k,l) 4 Kmerf The Kmerf test statistic is a kernel method for calculating independence by using a random forest induced similarity matrix as an input, and has been shown to have especially high gains in finite sample testing power in high dimensional settings [29]. It is computed using the following algorithm: 1. Run random forest with m trees. Independent bootstrap samples of size nb ≤ n are drawn to build a tree each time; each tree structure within the forest is denoted as φw ∈ P , w ∈ {1, . . . , m}; φw(xi) denotes the partition assigned to xi. 2. Calculate the proximity kernel: m 1 X Kx = I(φ (x ) = φ (x )), ij m w i w j w=1 where I(·) is the indicator function for how often two observations lie in the same partition. 3. Compute the induced kernel correlation: let ( x 1 Pn x 1 Pn x 1 Pn x x Kij − n−2 t=1 Kit − n−2 s=1 Ksj + (n−1)(n−2) s,t=1 Kst when i 6= j Lij = 0 otherwise. Then let Ky be the Euclidean distance induced kernel, and similarly compute Ly from Ky. The unbiased kernel correlation equals

1 x y (2.10) Kmerfn(x, y) = tr(L L ). n(n − 3) 2.5 Permutation Tests For many early independence tests, such as Pearson’s, analytical p-values are available. When such analytic approximations are unknown, permutation tests permute either of the input data matrices x or y and calculate test statistics for each permutation. Doing so many times approximates the null distribution from which the observed test statistic can be compared to generate a p-value [43, 44]. In the case of nonparametric tests, permutations can be used to exactly calculate the p-value since calculations are not dependent upon a reference distribution [32]. However, in the case of large amounts of data, calculating every permutation is impractical and often computationally expensive. A finite num- ber of permutations typically approximates the true null distribution quite well with a minimal additional computational cost. [31, 32]. All tests that are used in section3 use this permutation method to approx- imate a p-value. 2.6 Implementation Details All independence tests in this manuscript, aside from Manova, were implemented in the hyppo package [25]. Both Dcorr and Hsic are implemented using their unbi- ased versions. As mentioned previously, Dcorr uses the L2 norm as the default distance metric, and Hsic uses a Gaussian kernel as the default kernel, with the of the pairwise difference of the data as the bandwidth. The statsmodels PyMANOVA and base R’s Manova were used as references. 3 Results 3.1 K -sample Tests as Independence Tests k-sample tests can be implemented as independence tests as follows: consider u1,..., uk as matrices of size n1 × p, . . . , nk × p, where p refers to the Pk number of dimensions and ni refers to the number of samples of ui. Letting n = i=1 ni, define new data matrices x and y such that,   u1  .  n×p x =  .  ∈ R , uk   1n1×1 0n1×1 ... 0n1×1 0n ×1 1n ×1 ... 0n ×1 y =  2 2 2  ∈ n×k.  . . .. .  R  . . . . 

0nk×1 0nk×1 ... 1nk×1 5 Additionally, in the two-sample case,   u1 n×p x = ∈ R , u2   0n ×1 n y = 1 ∈ R . 1n2×1 That is, x can be thought of as the data matrix (contains all the concatenated data) while y can be thought of as the label matrix (labels x from whichever original input the data came from). Therefore, x and y are now paired data matrices, and thus dependence of x on y indicates that the labels are informative; in other words, that u and v have been sampled from different distributions. This idea is similar to the design matrix proposed in chapter 6 of Bickel and Doksum [45]. In fact, the process of performing a one-way k-sample test can be extended in the multiway (when samples have multiple labels). While the formulation for x is the same as specified above, but y is formulated as follows: given u and v as defined above, to perform a w-way test where w < k,   1n1×1 0n1×1 ... 1n1×1 1n ×1 1n ×1 ... 0n ×1 y =  2 2 2  ∈ n×k.  . . .. .  R  . . . . 

0nk×1 1nk×1 ... 1nk×1 where each row of y contains w 1ni elements. This leads to label matrix distances proportional to how many labels (ways) samples differ by, a hierarchy of distances between samples thought to be true if the null hypothesis is rejected. Performing a multilevel test involves constructing x and y using either of the methods above and then performing a block permutation [46]. Essentially, the permutation is striated, where permutation is limited to be within a block of samples or between blocks of samples, but not both. This is done because the data is not freely exchangeable, so it is necessary to block the permutation to preserve the joint distribution [46]. 3.2 Theorems In this section, we present the theoretical results. The proofs and certain technical details (like the mathematical formulation of Energy and Mmd) are in the appendix. We first show a consistent independence test is also consistent for k-sample test: k Theorem 1. Let Y ∈ R be the 1-trial multinomial distribution of probability (π1, π2, . . . , πk) where nl πl = n , and X be the following mixture:

k d X X = UlI(Y (l) 6= 0), l=1 where Ul is the underlying for each ul and I(·) is the indicator function. Then X is d d d independent of Y if and only if U1 = U2 = ··· = Uk. namely, any method that is universally consistent for testing independence between x and y is also universally consistent for a k-sample test. In particular, distance covariance is actually equivalent to the two-sample energy statistic proposed in Székely and Rizzo [3]: Theorem 2. Assume both distance covariance and energy statistic use a same translation invariant metric d(·, ·). Denote β = d(0, 1) − d(0, 0), it follows that

2 2 2n1n2β Dcovn(x, y) = · Energyn ,n (u1, u2), n4 1 2 Under the permutation test, distance covariance, distance correlation, and energy statistic have the same testing p-value. 6 The equivalence can be established between the maximum mean discrepancy proposed in [4] and the Hilbert-Schmidt independence criterion as well: Theorem 3. Assume both Hilbert-Schmidt independence criterion and maximum mean discrep- ancy use a same translation-invariant kernel k(·, ·). Denote β = k(0, 1) − k(0, 0), it follows that

2 2 2n1n2β Hsicn(x, y) = · Mmdn ,n (u1, u2), n4 1 2 Under the permutation test, Hilbert-Schmidt independence criterion and maximum mean discrepancy have the same testing p-value. Lastly, we establish a relationship between distance covariance and k-sample Energy, proposed in Rizzo et al. [13], which consist of the same number of pairwise energy components but weight them differently. Theorem 4. Assume both distance covariance and k-sample energy statistic use a same translation invariant metric d(·, ·). Denote β = d(y(s, :), y(t, :)) − d(y(s, :), y(s, :)) for some s 6= t, it follows that

Pk 2 X n(ns + nt) − l=1 nl Dcovn(x, y) = β nsnt · Energyns,nt (us, ut). n4 1≤s

Distance covariance and k-sample energy statistic become the same if and only if either k = 2, or n1 = n2 = ... = nk, in which case 2β Dcovn(x, y) = Energyn ,...,n ({uk}). nk 1 k

3.3 Gaussian Simulations Consider the simplest possible three-sample tests, where in each case, all three samples are Gaussian with identity covariance matrix (I): 1. None Different All three groups are Gaussian with the same mean: µ = (0, 0). 2. One Different Two of the Gaussians have the same mean, while the third has a different mean, thus, µ = (0, 0) for two of the Gaussians and µ = (0, ) for the third Gaussian. 3. All Different√ The three means form√ an equilateral triangle with center√ (0, 0) and radius , thus, µ1 = (0, 3/3 × ), µ2 = (−/2, − 3/6 × ), and µ3 = (/2, − 3/6 × ). Figure1 shows (top) scatter plots and (bottom) statistical power for each of the three cases, for two-dimensional Gaussians where  is increased from 0 to 1. None Different demonstrates that each test controls type I error properly. Since there is no difference in distribution, all tests are expected to have power no greater than α (0.05 in this case). One Different and All Different show that as the distributions separate from each other, all tests (k-sample Dcorr, Hsic, Mgc, and Manova) perform nearly the same. In Figure2, we consider the same except fix  = 0.5, and vary the dimension of each Gaussian from 2 to 100. To vary the dimension of each Gaussian, a two-dimensional Gaussian at the fixed  was simulated and additional uninformative features draw from a standard normal were added. As dimension increases, statistical power is expected to decrease. Once again, we verify each test controls type I error properly in None Different. When examining the One Different and the All Different cases, k-sample Mgc, Dcorr, and Hsicoutperform Manova, and the separation between k-sample Mgc, Dcorr, and Hsic power and Manova grows as dimension increases. Figure3 investigates the nature of the multiway effect and what settings are ideal for performing multiway tests. Label matrices (row A) are shown when the pairwise distance matrix is computed where each block is ni × ni where ni is the number of samples for each cluster (100 for these simulations). To investigate performance multiway tests, we simulated three two-dimensional Gaussians with means forming a triangle: one at the origin and the other two a fixed distance (c) away from the origin and sepa- p 2 2 rated by a variable distance (). Analytically the means were µ1 = (0, 0), µ2 = (−/2, − c − (/2) ), 7 Power vs. increasing cluster separation

None Different One Different All Different Cluster 1 Cluster 2 Cluster 3 Scatter Plots

1 Dcorr MGC Hsic

Power Manova

0 0 1 0 1 0 1 Cluster Separation

Figure 1: Power versus epsilon curves for each of three different parametric settings. Three two-dimensional Gaussians were generated for four different cases with 100 samples each (see paragraph3 for details). The top row shows a of each simulation for a given cluster separation, and the bottom row shows the power curves for each simulation as cluster separation increases (averaged over 5 repetitions). All methods are valid because power is ≤ α (column 1). There is no discernable difference in power between k-sample Dcorr, Hsic, Mgc, and Manova, despite that these are all low dimensional Gaussian settings which should be ideal for Manova.

p 2 2 and µ3 = (/2, − c − (/2) ) and covariance matrices for each were the identity matrix. When the two variable Gaussians are more different from each other than the Gaussian at the origin (i.e.  > c), the assumed multiway label matrix hierarchy in Figure3 reflects the true hierarchy. When the two variable Gaussians are more similar to each other (i.e.  < c), the assumed hierarchy is incorrect. In figure4, we demonstrate that non-multilevel Dcorr is invalid in the multilevel setting. We sam- pled 100 means from each of two Gaussians with covariance equal to the identity matrix (N (0, 1) and N (ε, 1)) a fixed distance away. From each mean µ, two samples were sampled from a Gaussian with variance 0.1 around each mean (N (0µ, 0.1)). Under the null, where cluster separation is 0, regular Dcorr has power greater than the α-level and so is an invalid statistic. Multilevel Dcorr, however, is a valid statistic (power is equal to alpha) as its permutations are restricted to reflect proper exchange- ability of samples under the null [46]. So, while non-multilevel Dcorr achieves higher power than its multilevel variant, this is simply an artifact of non-multilevel Dcorr being invalid in this setting. In our case, we fixed c = 0.3 and varied the separation () from 0 to 0.6. Thus, we expected a decrease in power when the multiway assumption was false (i.e.  = 0.2 < c as shown in row B, left), and we expected an increase in power when the multiway assumption was true (i.e.  = 0.5 > c as shown in row B, right). This is because, under the , the multiway test statistic is larger or smaller than the oneway statistic depending on if the multiway assumption is true or false, respectively. Row C shows the power comparisons when increasing separation (left) and increasing dimension (right) in this simulation setting. When fixing dimension at two and increasing separation, multiway Dcorr power is low when the cluster separation is less than c, and appears to be at or above the power of one-way k-sample Mgc, Hsic, Dcorr, and Manova when the separation is greater than c as expected. When we fix  = 0.5 > c, and increase dimension, multiway Dcorr dominates all other tests at all dimensions. 8 Power vs. increasing Gaussian dimension

None Different One Different All Different Cluster 1 Cluster 2 Cluster 3 Scatter Plots

1 MGC Dcorr Hsic

Power Manova

0 2 350 2 350 2 350 Dimension

Figure 2: Power versus dimension curves for each of three different parametric settings for a fixed sample size (100 samples) and epsilon (0.5). As dimension increases, Mgc, Dcorr, and Hsic all perform better than Manova (columns 2 and 3), especially as sample size approaches number of dimension. Manova cannot func- tion once dimension exceeds rank, at most equal to sample size (300 in this simulation, denoted by the vertical dashed line).

These results suggest that even at a simulation setting where the Manova test is expected to perform the best (linear simulation setting, all distributions Gaussian, all distributions same covari- ance), nonparametric k-sample tests can perform as well, and sometimes much better (such as in high-dimensional multiway testing). 3.4 A Benchmark Suite of 20 Rotated Simulations for K-Sample Testing We consider a bench- mark suite of 20 different distributions as developed previously for independence testing, including polynomial (linear, quadratic, cubic), trigonometric (sinusoidal, circular, ellipsoidal, spiral), geometric (square, diamond, W-shaped), and other relationships [5, 17, 18, 25, 28, 47, 48]. In each case, we sample n times from one of these 20 different distributions, and then apply a counter-clockwise rotation to the distribution to generate a second sample, and then apply a clockwise rotation to the distribution to form a third sample (so, in the following, n = m for all simulations). In each case, the noise distribution is determined as described in Vogelstein et al. [28]. AppendixC overviews the rotation process used to generate the three samples. The following three figures show power curves for each of the 20 settings. The bottom right panel illustrates the power under the null, which must be less than or equal to α to be a valid test. Figure5 evaluates the tests for varying sample size in three-sample tests where x, y, and z are two- dimensional, and Fy is rotated 90 degrees counter-clockwise relative to Fx and Fz is rotated 90 degrees clockwise relative to Fx. The y-axis shows the power of each test relative to Manova’s power, meaning that if a test achieves higher power than Manova its curve is above the red line, and otherwise its curve is below the red line. In this setting, k-sample Mgc and k-sample Kmerf perform as well or better than all other k-sample tests in all simulation settings and sample sizes while properly controlling Type I error. Manova performs similarly to k-sample Cca and k-sample RV. Note that all of nonparametric k-sampletests we benchmarked outperform Manova, even in the linear and Gaussian settings. Figure6 shows the same 20 settings, except the sample size is fixed at n = 100 the rotation angle ◦ ◦ for Fy and Fz is varied from 0 to 90 . As with Figure5, power was plotted relative to Manova. In this 9 (A) (B) (C) Default distances Weak multiway Cluster separation 1 0 1 1

1 0 1 Power =0.2 1 1 0 0 0.0 0.3 0.6 Separation ( ) Multiway distances Strong multiway Added noise dimensions 1 0 1 1

1 0 2 Power =0.5 1 2 0 0 2 100 Dimension

Cluster 1 MGC Hsic Cluster 2 Dcorr Manova Cluster 3 Multiway Dcorr

Figure 3: Comparisons of different simulation settings for multiway and one-way tests. (A) Multiway tests are manipulated using the usual one-hot encoding for label matrices (see 3.3 for details). (B) Three two-dimensional Gaussians whose means form an isosceles triangle with  = 0.2 and  = 0.5 used for visualization purposes. Weak multiway effects are expected when  < c (top) and strong multiway effects are expected when  > c (bottom). (C) At a fixed sample size (100 samples) and dimension (2), Multiway Dcorr performs worse than the other benchmarked tests when  < 0.3 and performs at above all other tests when  > 0.3, as expected (top). When  = 0.5, sample size is fixed (100 samples), and the dimension of the Gaussians is increased, multiway Dcorr dominates all other tests (bottom).

setting, for nearly all angles and simulation settings, k-sample Mgc and k-sample Kmerf achieved the same or higher power when compared to every other test in nearly all settings. Here, Manova outper- forms k-sample Mgc in one simulation settings (exponential). Visually inspecting these settings indi- cates that these settings are approximately Gaussian, where we previously demonstrated Manova can achieve higher power than k-sample Mgc and Kmerf for certain parameter settings and sample size combinations. Figure7 shows the power for these tests as the number of dimensions increases, while the sample size and angle are fixed at 100 and 90 degrees, respectively. In this setting, k-sample Mgc and Kmerf out- perform every test in nearly all settings again. In fact, we see for the majority of the simulations, k-sample Kmerf outperforms all other simulations, demonstrating the power of using this data in high dimensional settings. Manova performed better in the cubic and Bernoulli simulations; visual inspection indicates that these two settings are approximately Gaussian. 10 Multilevel Dcorr: Power vs. Cluster separation

Null Alternative Cluster 1 Cluster 2 Scatter Plots

1 Dcorr Multilevel Dcorr Power

0 0 1 0 1 Cluster Separation

Figure 4: Power versus epsilon curves for regular and multilevel Dcorr in a multilevel setting. 100 means were sampled from each of two, two-dimensional Gaussians. Two samples were generated from Gaussians centered at each mean and with lower variance (400 samples total). The top row shows a scatter plot of each simulation for a given cluster separation, and the bottom row shows the power curves for each simulation as cluster separation increases (averaged over 5 repetitions). In the multilevel setting, only multilevel Dcorr is valid as power is ≤ α, marked by the dashed grey line, under the null (column 1). Regular Dcorr is invalid and its apparent greater power under the alternative (column 2) is an artifact of its invalidity.

4 Real data : fMRI measurements Multiway and multilevel effects were investigated with Dcorr, which was used to test for differences in functional magnetic resonance imaging (fMRI) data. Specifically, the data consists of 75 subjects: 28 experienced meditators (over ten thousand hours of practice each) and 47 novice meditators (one weekend of training). Each individual took part in three recording sessions: one at resting state, one during an open monitoring meditative state, and one during a compassion meditative state. The fMRI data were processed using the standard fmriprep [49] pipeline and projected onto the fsaverage5 [50] surface meshes. Each of the 225 subject scans was thus a time-series of between 200 and 300 timesteps across 18715 cortical vertices. We computed low dimensional embeddings of the subject scans (the fMRI community calls these embeddings ‘gradients’) and tested for differences between the states (recording task) and/or traits (ex- pert and novice). These embeddings were calculated using generalized analysis (GCCA) [51][52], functionally similar to group PCA [53], from the mvlearn Python package [54]. GCCA 18715 solved for embeddings {vi}i=1,...,225 ∈ R for each scan i across the cortical vertices whose sum of pairwise correlations was maximized, effectively aligning the embeddings which would otherwise be undetermined up to a rotation. This involved a two-step singular value decomposition procedure to cal- culate loading vectors in the temporal domain. The top three gradients for each subject were calculated, each orthogonal to the previous ones, and all of equivalent euclidean norm. The testing of the top three low dimensional embeddings posed a multisample, multilevel situation. For each of these embeddings, and a combination of them concatenated into a single vector, a six- sample multiway test was performed to identify the presence of any differences. Two subsequent three- sample tests were performed within the novices and experts, separately, as well as two-sample tests of interest. Because of repeat measurements from the same subject in some of the tests, a strong within- subject effect dominated and so a multilevel permutation-correction was applied to yield a valid test at 11 Multivariate Three-Sample Testing Increasing Sample Size

Linear Exponential Cubic Joint Normal Step 1

0

1

Quadratic W-Shaped Spiral Bernoulli Logarithmic 1

0

1

Fourth Root Sine 4 Sine 16 Square Two Parabolas 1

0 Statistical Power Relative to Manova 1

Circle Ellipse Diamond Multiplicative Independence 1

0

1 5 100 5 100 5 100 5 100 5 100 Sample Size

KMERF MGC Dcorr Hsic Manova HHG CCA RV

Figure 5: Power versus sample size curves for each of 20 three-sample simulations for a fixed angle (90 degrees), where all inputs are two-dimensional (averaged over 5 repetitions). Note the noise applied to the circle simulation is not isotropic. Power curves are plotted relative to Manova: those above 0 outperform Manova and those below 0 perform worse than Manova. K -sample Mgc and k-sample Kmerf empirically dominate all other tests, meaning it always achieves as high or higher statistical power for all simulations and sample sizes.

the α = 0.05 level. Because multiple hypothesis tests were conducted, in each k-sample stage we applied the conservative Bonferroni correction [55], adjusting our p-values by the number of tests run, up to and including that stage. As shown in Figure8, we find significant differences between all subjects during resting state and compassion, during compassion and open monitoring, and between novices in compassion and open monitoring. In the case of all subjects during rest versus all subjects during compassion, it appears that the third gradient is where the predominant differences lie. Additionally, omnibus tests in most cases lead to significant pairwise tests of interest (not shown). 5 Conclusion We have presented several k-sample-tests using independent statistics. The transfor- mation of input data presented is sufficiently general that it can be applied to any future independence tests that achieve higher statistical power than the existing ones. By presenting a simulation setting in which Manova is expected to perform best, and showing that some nonparametric k-sample tests presented performed a bit better than Manova, demonstrates the value of using this implementation 12 Multivariate Three-Sample Testing Increasing Angle

Linear Exponential Cubic Joint Normal Step 1

0

1

Quadratic W-Shaped Spiral Bernoulli Logarithmic 1

0

1

Fourth Root Sine 4 Sine 16 Square Two Parabolas 1

0 Statistical Power Relative to Manova 1

Circle Ellipse Diamond Multiplicative Independence 1

0

1 0 90 0 90 0 90 0 90 0 90 Angle

KMERF MGC Dcorr Hsic Manova HHG CCA RV

Figure 6: Power versus angle for 20 three-sample tests with a fixed sample size (100 samples) in two dimensions (averaged over 5 repetitions). k-sample Mgc and Kmerf empirically dominate the other tests in nearly all of the simulation settings. Manova performs slightly better than k-sample Mgc for certain angles in the exponential simulation, probably because those settings closely approximate the setting Manova was designed for.

of nonparametric k-sample testing. Further, in the majority of circumstances across many depen- dency structures, k-sample Mgc in particular, performed at the same level or better than any other k-sample test we evaluated. Also, we investigated the extension of our k-sample testing procedure to multiway and multilevel tests. We found when there is a suspected multiway effect in which one effect dominates the oth- ers, multiway tests perform at or above state-of-the-art test at low dimensions, and dominates these tests as dimension increases. We also applied multiway and multilevel tests to real data too large for Manova to be run, where we were able to find significant differences in multiple embeddings between multiple states. It is also worth noting that this procedure does not add any additional computational complexity to the algorithms, and so the expected speed of each algorithm is dependent upon the com- putational complexity of the independence test being run. As a result, there are, in fact, many more k-sample tests then previously thought and we can exploit the finite-sample testing power advantages of new independence tests to create even more powerful k-sample tests in the future. 13 Multivariate Three-Sample Testing Increasing Dimension

Linear Exponential Cubic Joint Normal Step 1

0

1

Quadratic W-Shaped Spiral Bernoulli Logarithmic 1

0

1

Fourth Root Sine 4 Sine 16 Square Two Parabolas 1

0 Statistical Power Relative to Manova 1

Circle Ellipse Diamond Multiplicative Independence 1

0

1 3 10 3 10 3 10 3 10 3 10 Dimension

KMERF MGC Dcorr Hsic Manova HHG CCA RV

Figure 7: Power versus dimension for 20 three-sample simulations with fixed sample size (100 samples) and an- gle (90 degrees) in two-dimensions (averaged over 5 repetitions). k-sample Mgc and Kmerf empirically dominate the other tests in nearly all of the simulation settings.

Data and Code Availability Statement The analysis and visualization of this data were done using the hyppo open-source package https://hyppo.neurodata.io/ and the mvlearn package https://mvlearn. github.io/. Source code, documentation, and tutorials can be found there. Acknowledgements This work is graciously supported by the Defense Advanced Research Projects Agency (DARPA) Lifelong Learning Machines program through contract FA8650-18-2-7834, the Na- tional Institute of Health awards RO1MH120482 and T32GM119998, and the National Science Foun- dation award DMS-1921310, and Microsoft Research. The authors would also like to acknowledge Dr. Russell Lyons, Dr. Minh Tang, Mr. Ronak Mehta, Mr. Eric W. Bridgeford, and the rest of the NeuroData family at Johns Hopkins University for helpful feedback throughout the development process.

14 Gradient(s)

K MultilevelMultiway Samples 1 2 3 1,2 2,3 1,3 1,2,3 1 6 | X | X | All states, traits X X X X 3 | X | | EXP states X 3 | X | | NOV states X X X X 2 | | | EXP all , NOV all 2 | X | | EXP rest, EXP comp 2 | X | | EXP rest, EXP open 2 | X | | EXP open, EXP comp 2 | X | | EXP rest, EXP med 2 | X | | NOV rest, NOV comp 2 | X | | NOV rest, NOV open 2 | X | | NOV open, NOV comp X X 0.1 2 | X | | NOV rest, NOV med 2 | | | EXP rest, NOV rest 2 | | | EXP comp, NOV comp 2 | | | EXP open, NOV open 0.05 2 | | | EXP med , NOV med 2 | | | EXP rest, NOV comp 2 | | | EXP rest, NOV open 2 | | | EXP comp, NOV rest 2 | | | EXP comp, NOV open 2 | | | EXP open, NOV rest pvalue (log scale, bonferroni-adjusted) 2 | | | EXP open, NOV comp 0.01 2 | X | | ALL rest, ALL comp X X X X 2 | X | | ALL rest, ALL open 2 | X | | ALL comp, ALL open X X 2 | X | | ALL rest, ALL med

Figure 8: Omnibus and pairwise tests on combinations of states (open, rest, comp) and traits (NOV, EXP) reveal significant corrected p-values (denoted with a white X) at the 0.05 − α level. The 6-sample test shows significant effects from the third embedding and combinations of it. Following 3-sample and 2-sample tests reveal additional significant differences, primarily in the third embedding within novice meditators and between the compassion meditative state.

15 APPENDIX Appendix A. Supplementary Information. The analysis and visualization of this data were done using the hyppo open-source package https://hyppo.neurodata.io/ and the mvlearn package https://mvlearn.github.io/. Source code, documentation, and tutorials can be found there. Figure repli- cation code for this manuscript can be found here: https://github.com/neurodata/hyppo-papers/tree/ main/ksample. Appendix B. Proofs. B.1 Theorem1

d d d d Proof. As X|Y (s)6=0 = Ut, U1 = U2 = ··· = Uk if and only if the conditional distribution X|Y (s) does not change with Y (s), if and only if X is independent of Y . Therefore, any consistent independence statistic can be used for consistent k-sample testing. B.2 Theorem2 Proof. Without loss of generality, assume we use the Euclidean distance so that β = d(0, 1) − d(0, 0) = 1. From [3], the two-sample energy statistic equals   n1 n2 n1 n2 1 X X 2 X 2 X Energyn1,n2 (u, v) = 2n1n2 d(ui, vj) − n2 d(ui, uj) − n1 d(vi, vj) . n2n2 1 2 i=1 j=1 i,j=1 i,j=1 Then the sample distance covariance equals

1 x y Dcovn(x, y) = tr(HD HHD H) n2 1 = tr(DxHDyH) n2 n 1 X = Dx · (HDyH) n2 ij ij i,j=1 by the property of matrix trace and the idempotent property of the centering matrix H. The two distance matrices satisfy  d(ui, uj) if 1 ≤ i, j ≤ n1, x  Dij = d(xi, xj) = d(vi, vj) if n1 < i, j ≤ n,  d(ui, vj) otherwise, ( y 0 if 1 ≤ i, j ≤ n1 or n1 < i, j ≤ n, D = d(yi, yj) = ij 1 otherwise.

It follows that  2 −2n2  n2 if 1 ≤ i, j ≤ n1, y  −2n2 (HD H)ij = 1 n2 if n1 < i, j ≤ n,  2n1n2  n2 otherwise. Therefore, up to a scaling factor, the centering scheme via distance covariance happens to match the weight of energy statistic for each term. Expanding all terms leads to

 n n n n  1 X1 X2 X1 X2 Dcov (x, y) = 4n n d(u , v ) − 2n2 d(u , u ) − 2n2 d(v , v ) n n4  1 2 i j 2 i j 1 i j  i=1 j=1 i,j=1 i,j=1 2n2n2 = 1 2 Energy (u, v). n4 n1,n2 1 2 2 2n1n2 As the scalar n4 is invariant under any permutation of the given sample data, distance covariance and energy statistic have the same testing p-value via permutation test. To extend the equivalence to any translation-invariant metric beyond the Euclidean metric, one only needs to multiply the matrix HDyH and the above equations on the energy side by the scalar β = d(0, 1)−d(0, 0), and everything else is the same. B.3 Theorem3 Proof. The equivalence between Hilbert-Schmidt independence criterion and maximum mean dis- crepancy can be established via the exact same procedure. Assuming d(·, ·) is a translation invariant kernel and the distance matrices are kernel matrices, Energyn1,n2 (u, v) becomes −Mmdn1,n2 (u, v), Dcovn(x, y) becomes −Hsicn(x, y), and every other step in the proof of Theorem2 is the same. B.4 Theorem4 Proof. First, each pairwise energy statistic equals

ns nt ns nt 2 X X s t 1 X s t 1 X s t Energy (us, ut) = d(u , u ) − d(u , u ) − d(u , u ). ns,nt n n i j n2 i j n2 i j s t i=1 j=1 s i,j=1 t i,j=1 Then for the distance covariance, the matrix Dy equals ( 0 for within-group entries, Dy = ij β for between-group entries.

Pk 2 1 Pn y t=1 nt The whole matrix mean equals n2 i,j=1 Dij = β(1 − n2 ), and the mean of each matrix row is 1 Pn y ns n t=1 Dit = β(1 − n ) assuming the ith point belongs to group k. As n n n 1 X 1 X 1 X (HDyH) = Dy − Dy − Dy + Dy , ij ij n it n tj n2 tt t=1 t=1 t=1 the centered matrix equals

  2nn −Pk n2  β s t=1 t − 1 for entries within group s, y  n2 (HD H)ij =  Pk 2  n(ns+nt)− t=1 nt β n2 for entries between group s and t. next, we show the within group entries satisfies

Pk 2 1,··· ,k Pk 2 2nns − t=1 nt X n(ns + nt) − l=1 nl nt (B.1) 2 − 1 = − 2 · n n ns s6=t for each group s. Without loss of generality, assume s = 1 and multiply n2 to it. Thus, proving Equa- tion B.1 is equivalent to prove k k k k X 2 2 X nt X 2 X nt 2nn1 − nt − n + n(n1 + nt) − nt = 0 n1 n1 t=1 l=2 t=1 l=2 k k k k k 2 2 X X 2 X 2 X nt X nt ⇔ 2nn1 − n + nnt − nt − nt + n = 0 n1 n1 l=2 t=1 t=1 l=2 l=2 k k k ! X X n − n1 n X ⇔ 2nn − n2 + n(n − n ) − n2 − n2 + n2 − n2 = 0 1 1 t t n n t 1 t=1 t=1 1 1 t=1 k   X n − n1 n ⇔ nn − n2 1 + − − nn = 0, 1 t n n 1 t=1 1 1 2 which all cancel out at the last step. Therefore, Equation B.1 guarantees the weight in each term of (u , u ) matches the Energyns+nt s t corresponding weight in HDyH, and it follows that n 1 X Dcov (x, y) = Dx · (HDyH) n n2 ij ij i,j=1 Pk 2  ns nt β X n(ns + nt) − n X X = l=1 l 2 d(us, ut ) n2 n2 i j 1≤s

( Pk 2 ) X n(ns + nt) − l=1 nl Dcovn(x, y) = β nsnt · Energy (us, ut) n4 ns,nt 1≤s

X nnsnt o Energy ({uk}) = Energy (us, ul) , n1,...,nk 2n ns,nt 1≤s

Pk 2 these two statistics can be equivalent up-to scaling if and only if n(ns + nt) − l=1 nl is a fixed constant for all possible s 6= t, or equivalently ns + nt is fixed. This is true when either k = 2, or n n1 = n2 = ... = nk = k for k > 3, in which case

β X 2 Dcovn(x, y) = {n · Energy (us, ut)} n2k ns,nt 1≤s

Therefore, Dcovn(x, y) is equivalent to k-sample energy when k = 2 or every group has the same size. Appendix C. Simulations. We perform three-sample testing between Z and Z0, and Z00 as follows: let Z = [X|Y ] be the respective random variables from a benchmark suite of 20 independence testing simulations [28, 29]. Then define Qθ as a rotation matrix for a given angle θ, i.e., cos θ 0 ... − sin θ  0 1 ... 0  Q =   θ  . . .. .   . . . .  sin θ 0 ... cos θ Then we let

0 T Z = QθZ 00 T Z = Q−θZ be the rotated versions of Z. Figure E1 shows the 20 simulations and their rotated variants used to produce the power curves in figures5,6, and7. 3 Linear Exponential Cubic Joint Normal Step

Quadratic W-Shaped Spiral Bernoulli Logarithmic

Fourth Root Sine 4 Sine 16 Square Two Parabolas

Circle Ellipse Diamond Multiplicative Independence

Sample 1 Sample 2 Sample 3

Figure E1: Simulations for power curves. The first dataset (black dots) is 500 samples from each of the 20 two-dimensional, noisy simulation settings from the hyppo package. The two other datasets is the first dataset rotated by 60 degrees clockwise and 60 degrees counter-clockwise. Noise in these simulations are reduced for visualization purposes.

4 References. [1] Student. The probable error of a mean. Biometrika, pages 1–25, 1908. [2] . The generalization of student’s ratio. In Breakthroughs in statistics, pages 54–65. Springer, 1992. [3] Gábor J Székely and Maria L Rizzo. Energy statistics: A class of statistics based on distances. Journal of statistical planning and inference, 143(8):1249–1272, 2013. [4] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012. [5] Ruth Heller, Yair Heller, and Malka Gorfine. A consistent multivariate test of association based on ranks of distances. Biometrika, 100(2):503–510, 2012. [6] Ronald A Fisher. Xv.—the correlation between relatives on the supposition of mendelian inheri- tance. Earth and Environmental Science Transactions of the Royal Society of Edinburgh, 52(2): 399–433, 1919. [7] Maurice S Bartlett. Multivariate analysis. Supplement to the journal of the royal statistical society, 9(2):176–197, 1947. [8] Russell Warne. A primer on multivariate analysis of variance (manova) for behavioral scientists. Practical Assessment, Research, and , 19(1):17, 2014. [9] JP Stevens. Applied multivariate statistics for the social sciences. lawrence erlbaum. Mahwah, NJ, pages 510–1, 2002. [10] Marti J Anderson. A new method for non-parametric multivariate analysis of variance. Austral ecology, 26(1):32–46, 2001. [11] Dennis Dobler, Sarah Friedrich, and Markus Pauly. Nonparametric manova in meaningful effects. Annals of the Institute of Statistical Mathematics, pages 1–26, 2019. [12] Ruth Heller, Yair Heller, Shachar Kaufman, Barak Brill, and Malka Gorfine. Consistent distribution- free k-sample and independence tests for univariate random variables. The Journal of Machine Learning Research, 17(1):978–1031, 2016. [13] Maria L Rizzo, Gábor J Székely, et al. Disco analysis: A nonparametric extension of analysis of variance. The Annals of Applied Statistics, 4(2):1034–1055, 2010. [14] Holmes Finch. Comparison of the performance of nonparametric and parametric manova test statistics when assumptions are violated. Methodology, 1(1):27–38, 2005. [15] Karl Pearson. Vii. note on regression and inheritance in the case of two parents. proceedings of the royal society of London, 58(347-352):240–242, 1895. [16] Gábor J Székely, Maria L Rizzo, et al. Brownian distance covariance. The annals of applied statistics, 3(4):1236–1265, 2009. [17] Gábor J Székely and Maria L Rizzo. The distance correlation t-test of independence in high dimension. Journal of Multivariate Analysis, 117:193–213, 2013. [18] Gábor J Székely, Maria L Rizzo, Nail K Bakirov, et al. Measuring and testing dependence by correlation of distances. The annals of statistics, 35(6):2769–2794, 2007. [19] Russell Lyons et al. Distance covariance in metric spaces. The Annals of Probability, 41(5): 3284–3305, 2013. [20] Arthur Gretton and László Györfi. Consistent nonparametric tests of independence. Journal of Machine Learning Research, 11(Apr):1391–1423, 2010. [21] Arthur Gretton, Ralf Herbrich, Alexander Smola, Olivier Bousquet, and Bernhard Schölkopf. Kernel methods for measuring independence. Journal of Machine Learning Research, 6(Dec):2075– 2129, 2005. [22] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Schölkopf, et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends® in Machine Learning, 10(1-2):1–141, 2017. [23] Dino Sejdinovic, Bharath Sriperumbudur, Arthur Gretton, and Kenji Fukumizu. Equivalence of distance-based and rkhs-based statistics in hypothesis testing. The Annals of Statistics, 41(5):

5 2263–2291, 2013. [24] Cencheng Shen and Joshua T Vogelstein. The exact equivalence of distance and kernel methods in hypothesis testing. AStA Advances in Statistical Analysis, 2020. [25] Sambit Panda, Satish Palaniappan, Junhao Xiong, Eric W. Bridgeford, Ronak Mehta, Cencheng Shen, and Joshua T. Vogelstein. hyppo: A comprehensive multivariate hypothesis testing python package, 2019. [26] Cencheng Shen, Carey E Priebe, and Joshua T Vogelstein. From distance correlation to multiscale graph correlation. Journal of the American Statistical Association, 115(529):280–291, 2020. [27] Youjin Lee, Cencheng Shen, Carey E Priebe, and Joshua T Vogelstein. Network dependence testing via diffusion maps and distance-based correlations. Biometrika, 106(4):857–873, 2019. [28] Joshua T Vogelstein, Eric W Bridgeford, Qing Wang, Carey E Priebe, Mauro Maggioni, and Cencheng Shen. Discovering and deciphering relationships across disparate data modalities. eLife, 8:e41690, 2019. [29] Cencheng Shen, Sambit Panda, and Joshua T. Vogelstein. Learning interpretable characteristic kernels via decision forests, 2020. [30] Dave S Collingridge. A primer on quantitized data analysis and permutation testing. Journal of Mixed Methods Research, 7(1):81–97, 2013. [31] Meyer Dwass. Modified tests for nonparametric hypotheses. The Annals of Mathe- matical Statistics, pages 181–187, 1957. [32] Phillip I Good. Permutation, parametric and bootstrap tests of hypotheses: a practical guide to methods for testing hypotheses. Permutation, parametric and bootstrap tests of hy- potheses: a practical guide to resampling methods for testing hypotheses, 100(4), 2005. [33] Theodore Micceri. The unicorn, the normal curve, and other improbable creatures. Psychological bulletin, 105(1):156, 1989. [34] Stephen M Stigler. Do robust estimators work with real data? The Annals of Statistics, pages 1055–1098, 1977. [35] Gregory Carey. Multivariate analysis of variance (manova): I. theory. Retrieved May, 14:2011, 1998. [36] AC Rencher. Methods of multivariate analysis. DOI, 10(0471271357):66, 2002. [37] Maurice S Bartlett. A note on tests of significance in multivariate analysis. In Mathematical Pro- ceedings of the Cambridge Philosophical Society, volume 35, pages 180–185. Cambridge Univer- sity Press, 1939. [38] C Radhakrishna Rao. Tests of significance in multivariate analysis. Biometrika, 35(1/2):58–79, 1948. [39] G David Garson. Multivariate glm, manova, and mancova. Statnotes: Topics in multivariate analy- sis, 2009. [40] Chester L Olson. On choosing a test statistic in multivariate analysis of variance. Psychological bulletin, 83(4):579, 1976. [41] Chester Lewellyn Olson. A Monte Carlo investigation of the robustness of multivariate analysis of variance. PhD thesis, Thesis (Ph. D.)–University of Toronto, 1973. [42] Gábor J Székely, Maria L Rizzo, et al. Partial distance correlation with methods for dissimilarities. The Annals of Statistics, 42(6):2382–2412, 2014. [43] Ronald Aylmer Fisher. Statistical methods for research workers. In Breakthroughs in statistics, pages 66–70. Springer, 1992. [44] Cyrus R Mehta and Nitin R Patel. Exact inference for categorical data. Encyclopedia of biostatis- tics, 2:1411–1422, 1998. [45] Peter J Bickel and Kjell A Doksum. : basic ideas and selected topics, volume I, volume 117. CRC Press, 2015. [46] Anderson M Winkler, Matthew A Webster, Diego Vidaurre, Thomas E Nichols, and Stephen M Smith. Multi-level block permutation. Neuroimage, 123:253–268, 2015.

6 [47] David N Reshef, Yakir A Reshef, Hilary K Finucane, Sharon R Grossman, Gilean McVean, Pe- ter J Turnbaugh, Eric S Lander, Michael Mitzenmacher, and Pardis C Sabeti. Detecting novel associations in large data sets. science, 334(6062):1518–1524, 2011. [48] Malka Gorfine, Ruth Heller, and Yair Heller. Comment on detecting novel associations in large data sets. Unpublished (available at http://emotion. technion. ac. il/˜ gorfinm/filesscience6. pdf on 11 Nov. 2012), 2012. [49] Oscar Esteban, Christopher J. Markiewicz, Ross W. Blair, Craig A. Moodie, A. Ilkay Isik, Asier Erramuzpe, James D. Kent, Mathias Goncalves, Elizabeth DuPre, Madeleine Snyder, Hiroyuki Oya, Satrajit S. Ghosh, Jessey Wright, Joke Durnez, Russell A. Poldrack, and Krzysztof J. Gorgolewski. fMRIPrep: a robust preprocessing pipeline for functional MRI. Nature Meth- ods, 16(1):111–116, January 2019. ISSN 1548-7105. doi: 10.1038/s41592-018-0235-4. URL https://www.nature.com/articles/s41592-018-0235-4. Number: 1 Publisher: Nature Publishing Group. [50] Alexandre Abraham, Fabian Pedregosa, Michael Eickenberg, Philippe Gervais, Andreas Mueller, Jean Kossaifi, Alexandre Gramfort, Bertrand Thirion, and Gael Varoquaux. Machine learning for neuroimaging with scikit-learn. Frontiers in Neuroinformatics, 8, 2014. ISSN 1662-5196. doi: 10.3389/fninf.2014.00014. URL https://www.frontiersin.org/articles/10.3389/fninf.2014.00014/full. Publisher: Frontiers. [51] J. R. Kettenring. Canonical Analysis of Several Sets of Variables. Biometrika, 58(3):433–451, 1971. ISSN 0006-3444. doi: 10.2307/2334380. URL https://www.jstor.org/stable/2334380. Pub- lisher: [Oxford University Press, Biometrika Trust]. [52] Babak Afshin-Pour, Gholam-Ali Hossein-Zadeh, Stephen C. Strother, and Hamid Soltanian-Zadeh. Enhancing reproducibility of fMRI statistical maps using generalized canonical correlation analysis in NPAIRS framework. NeuroImage, 60(4):1970–1981, May 2012. ISSN 1095-9572. doi: 10.1016/ j.neuroimage.2012.01.137. [53] V. D. Calhoun, T. Adali, G. D. Pearlson, and J. J. Pekar. A method for making group inferences from functional MRI data using independent component analysis. Human Brain Mapping, 14(3): 140–151, 2001. ISSN 1097-0193. doi: 10.1002/hbm.1048. URL https://onlinelibrary.wiley.com/ doi/abs/10.1002/hbm.1048. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/hbm.1048. [54] Ronan Perry, Gavin Mischler, Richard Guo, Theo Lee, Alexander Chang, Arman Koul, Cameron Franz, and Joshua T. Vogelstein. mvlearn: Multiview Machine Learning in Python. arXiv:2005.11890 [cs, stat], May 2020. URL http://arxiv.org/abs/2005.11890. arXiv: 2005.11890. [55] C. Bonferroni. Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze, 8:3–62, 1936. URL https://ci.nii.ac. jp/naid/20001561442.

7