Arxiv:1910.08883V3 [Stat.ML] 2 Apr 2021 in Real Data [8,9]
Total Page:16
File Type:pdf, Size:1020Kb
Nonpar MANOVA via Independence Testing Sambit Panda1;2, Cencheng Shen3, Ronan Perry1, Jelle Zorn4, Antoine Lutz4, Carey E. Priebe5 and Joshua T. Vogelstein1;2;6∗ Abstract. The k-sample testing problem tests whether or not k groups of data points are sampled from the same distri- bution. Multivariate analysis of variance (Manova) is currently the gold standard for k-sample testing but makes strong, often inappropriate, parametric assumptions. Moreover, independence testing and k-sample testing are tightly related, and there are many nonparametric multivariate independence tests with strong theoretical and em- pirical properties, including distance correlation (Dcorr) and Hilbert-Schmidt-Independence-Criterion (Hsic). We prove that universally consistent independence tests achieve universally consistent k-sample testing, and that k- sample statistics like Energy and Maximum Mean Discrepancy (MMD) are exactly equivalent to Dcorr. Empirically evaluating these tests for k-sample-scenarios demonstrates that these nonparametric independence tests typically outperform Manova, even for Gaussian distributed settings. Finally, we extend these non-parametric k-sample- testing procedures to perform multiway and multilevel tests. Thus, we illustrate the existence of many theoretically motivated and empirically performant k-sample-tests. A Python package with all independence and k-sample tests called hyppo is available from https://hyppo.neurodata.io/. 1 Introduction A fundamental problem in statistics is the k-sample testing problem. Consider the p p two-sample problem: we obtain two datasets ui 2 R for i = 1; : : : ; n and vj 2 R for j = 1; : : : ; m. Assume each ui is sampled independently and identically (i.i.d.) from FU and that each vj is sampled i.i.d. from FV (and also that each ui and each vj is independent from one another). The two-sample testing problem tests whether the two datasets were sampled from the same distribution, that is, H0 : FU = FV ; (1.1) HA : FU 6= FV : j p Eq. (1.1) can also be generalized to k samples: let ui 2 R for j = 1; : : : ; k and i = 1; : : : ; nj be k datasets that are sampled i.i.d. from F1;:::;Fk and independently from one another. Then, H0 : F1 = F2 = ··· = Fk; (1.2) 0 HA : 9 j 6= j s.t. Fj 6= Fj0 To approach the problem of two-sample testing, Student’s t-test [1] and its multivariate generaliza- tion Hotelling’s T 2 [2] is traditionally used, while a few nonparametric alternatives have been proposed that operate well on multivariate, nonlinear data such as Energy [3], and maximal mean discrepancy (Mmd)[4], and Heller Heller and Gorfine’s test [5]. The two-sample testing problem can be generalized to the k-sample testing problem and here analysis of variance (Anova)[6] or its multivariate analogue multivariate Anova (Manova)[7] can be used, but these statistics either fail to or operate poorly upon, multivariate and nonlinear data. Also, Anova and Manova in particular suffer from fundamental assumptions that are not generally present arXiv:1910.08883v3 [stat.ML] 2 Apr 2021 in real data [8,9]. Recently, a few nonparametric alternatives to Manova [10, 11] have been proposed, such as multivariate k-sample Heller Heller Gorfine [12] and distance components (Disco)[13]. Non- parametric tests similar to Manova are desirable, especially when the assumptions of Manova are not met [14]. A closely related problem to the k-sample testing problem is the independence testing problem. It p q iid is framed as follows: given xi 2 R and yi 2 R , and n samples of (xi; yi) ∼ FXY . The two random ∗Sambit Panda and Cencheng Shen contribute equally to this work. Corresponding author: [email protected]. 1 Department of Biomedical Engineering, Johns Hopkins University; 2 Institute for Computational Medicine, Johns Hopkins University; 3 Department of Applied Economics and Statistics, University of Delaware; 4 Lyon Neuroscience Research Centre, Lyon 1 University; 5 Department of Applied Mathematics and Statistics, Johns Hopkins University; 6 Center for Imaging Science, Kavli Neuroscience Discovery Institute, Johns Hopkins University; Progressive Learning 1 variables X and Y are independent if and only if FXY = FX FY . So, the independence testing problem can be stated as, H0 : FXY = FX FY ; (1.3) HA : FXY 6= FX FY : Many correlation measures have been proposed to approach the problem laid out in Eq. (1.3), such as Pearson’s correlation [15]. But as with k-sample tests, many are unsuited to detect nonlinear and high- dimensional dependence structures within data. Recently, several statistics have been proposed that operate well on high-dimensional (potentially non-Euclidean) data, such as distance correlation (Dcorr) [16–19] and Hilbert-Schmidt independence criterion (Hsic)[20–22], which are equivalent formulations by Sejdinovic et al. [23], Shen and Vogelstein [24]. Heller, Heller, and Gofrine proposed another non- parametric independence test (Hhg) with particularly high power in certain nonlinear relationships [5]. Multiscale Graph Correlation (Mgc) has demonstrated higher statistical power on many multivariate, nonlinear, and structured data when compared to other independence tests [25–27]. Mgc is statisti- cally efficient, requiring about half or one-third of the number of samples to achieve the same statistical power as other approaches [28]. Furthermore, Kernel Mean Embedding Random Forest (Kmerf), that utilizes Dcorr and an induced kernel similarity matrix from random forest, has been shown to have even larger gains in power [29]. For each of these tests, p-values can be calculated using a random permutation test [30–32]. We prove that independence tests can be used for consistent k-sample testing, the Energy method and Mmd method are equivalent to Dcorr and Hsic, and empirically evaluate the state-of-the-art in- dependence tests as k-sample tests. When compared to existing k-sample tests over a suite of linear and nonlinear simulations, we demonstrate that these independence tests, and specifically Kmerfand Mgc, have higher statistical power than the alternatives in nearly all settings. All the k-sample tests are provided in the hyppo statistical package [25]. 2 Preliminaries 2.1 Notation Let R denote the real line (−∞; 1). Let FX , FY , and FXY refer to the marginal and joint distributions of random variables X and Y respectively. Let x and y refer to the samples from FX n×p m×p and FY and x 2 R and y 2 R refer to the matrix of observations of x and y respectively, that is, x = fx1; : : : ; xng and y = fy1; : : : ; ymg. The trace of an n × n square matrix is the sum of the Pn elements along the main diagonal: tr(x) = i=1 xii. The performance of tests requires defining metrics to evaluate the effectiveness across various sample sizes and various dimensions. The testing power for a given level α (Type 1 error level) test is equal to the probability of correctly rejecting the null hypothesis when the alternative is true. For a test to be consistent, statistical power must converge to unity as the sample size increases to 1. 2.2 Hotelling Hotelling is a generalization of Student’s t-test in arbritary dimension [2]. Con- iid iid sider input samples ui ∼ FU for i 2 f1; : : : ; ng and vi ∼ FV for i 2 f1; : : : ; mg. Let u¯ refer to the Pn columnwise means of u; that is, u¯ = (1=n) i=1 ui and let v¯ be the same for v. Calculate sample T T T covariance matrices Σ^ uv = u v and sample variance matrices Σ^ uu = u u and Σ^ vv = v v. Denote pooled covariance matrix Σ^ as (n − 1)Σ^ + (m − 1)Σ^ Σ^ = uu vv n + m − 2 Then, nm T ^ −1 (2.1) Hotellingn;m(u; v) = (u¯ − v¯) Σ (u¯ − v¯) n + m Since it is a multivariate generalization of Student’s t-tests, it suffers from some of the same assumptions as Student’s t-tests. That is, the validity of Manova depends on the assumption that random variables 2 are normally distributed within each group and each with the same covariance matrix. Distributions of input data are generally not known and cannot always be reasonably modeled as Gaussian [33, 34], and having the same covariance across groups is also generally not true of real data. 2.3 Manova Manova is a procedures for comparing more than two multivariate samples [8, 35]. It is as a multivariate generalization of the univariate Anova [8] using covariance matrices rather than the scalar variances. As in Rencher [36]: consider input samples x1; x2;:::; xk that have the same dimen- sionality p. Each xi, where i 2 f1; : : : ; kg is assumed to be sampled from a multivariate distribution N(µi; Σ) and so each sample is assumed to have the same covariance matrix Σ. The model for each p-dimensional vector of each xi is defined as follows: for j 2 f1; : : : ; nig, xij = µi + ij: In Manova, we are testing if the mean vectors of each of the k-samples are the same. That is, the null and alternate hypotheses are, H0 : µ1 = µ2 = ··· = µk; 0 HA : 9 j 6= j s.t. µj 6= µj0 Pni Let x¯i· refer to the columnwise means of xi; that is, x¯i· = (1=ni) j=1 xij. The pooled sample covariance of each group, W , is k ni X X T (2.2) W = (xij − x¯i·)(xij − x¯i·) : i=1 j=1 Pk Next, define B as the sample covariance matrix of the means. If n = i=1 ni and the grand mean Pk Pn is x¯·· = (1=n) i=1 j=1 xij, k X T (2.3) B = ni(x¯i· − x¯··)(x¯i· − x¯··) : i=1 Some of the most common statistics used when performing Manova include the Wilks’ Lambda, the Lawley-Hotelling trace, Roy’s greatest root, and Pillai-Bartlett trace (PBT) [37–39] (PBT is recognized to be the best of these as it is the most conservative [8, 40]) and Olson [41] has shown that there are minimal differences in statistical power among these statistics.