Multivariate Tests of Association Based on Univariate Tests

Multivariate tests of association based on univariate tests Ruth Heller Yair Heller Department of Statistics and Operations Research [email protected] Tel-Aviv University Tel-Aviv, Israel 6997801 [email protected] Abstract For testing two vector random variables for independence, we propose testing whether the distance of one vector from an arbitrary center point is independent from the distance of the other vector from another arbitrary center point by a univariate test. We prove that under minimal assumptions, it is enough to have a consistent univariate independence test on the distances, to guarantee that the power to detect dependence between the random vectors increases to one with sample size. If the univariate test is distribution-free, the multivariate test will also be distribution-free. If we consider multiple center points and aggregate the center-specific univariate tests, the power may be further improved, and the resulting multivariate test may have a distribution-free critical value for specific aggregation methods (if the univariate test is distribution free). We show that certain multivariate tests recently proposed in the literature can be viewed as instances of this general approach. Moreover, we show in experiments that novel tests constructed using our approach can have better power and computational time than competing approaches. 1 Introduction Let X 2 <p and Y 2 <q be random vectors, where p and q are positive integers. The null hypothesis of independence is H0 : FXY = FX FY ; where the joint distribution of (X; Y ) is denoted by FXY , and the distributions of X and Y , respectively, by FX and FY . If X is a categorical variable with K categories, then the null hypothesis of independence is the null hypothesis in the K-sample problem, H0 : F1 = ::: = FK ; where Fk; k 2 f1;:::;Kg is the distribution of Y in category k. The problem of testing for independence of random vectors, as well as the K-sample problem on a multivariate Y , against the general alternative H1 : FXY 6= FX FY ; has received increased attention in recent years. The most common approach is based on pairwise distances or similarity measures. See (26), (6), (24), and (12) for consistent tests of independence, and (10), (25), (1), (22), (5), and (8) for recent K-sample tests. Earlier tests based on nearest neighbours include (23) and (13). For the K-sample problem, the practice of comparing multivariate distributions based on pairwise distances is justified by the fact that, under mild conditions, the distributions differ if and only if the distributions of within and between pairwise distances differ (19). Other innovative approaches have also been considered in recent years. In (4) and (28), the authors suggest to reduce the multivariate data to a lower dimensional sub-space by (random) projections. Recently, in (3) another approach was introduced for the two sample problem, which is based on distances between analytic functions representing each of the distributions. Their novel tests are almost surely consistent when randomly selecting locations or frequencies and are fast to compute. 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. We suggest the following approach for testing for independence: first compute the distances from a fixed center point, then apply any univariate independence test on the distances. We show that this approach can result in novel powerful multivariate tests, that are attractive due to their theoretical guarantees and computational complexity. Specifically, in Section 2 we show that if H0 is false, then applying a univariate consistent test on distances from a single center point will result in a multivariate consistent test (except for a measure zero set of center points), where a consistent test is a test with power (i.e., probability of rejecting H0 when H0 is false) increasing to one as the sample size increases when H0 is false. Moreover, the computational time is that of the univariate test, which means that it can be very fast. In particular, a desirable requirement is that the null distribution of the test statistic does not depend on the marginal distributions of X and Y , i.e., that the test is distribution-free. Powerful univariate consistent distribution-free tests exist (see (11) for novel tests and a review), so if one of these distribution-free univariate test is applied on the distances, the resulting multivariate test is distribution-free. In Section 3 we show that considering the distances from M > 1 points and aggregating the resulting statistics can also result in consistent tests, which may be more powerful than tests that consider a single center point. Both distribution-free and permutation-based tests can be generated, depending on the choice of aggregation method and univariate test. In Section 4 we draw the connection between these results and some known tests mentioned above. The tests of (10) and of (12) can be viewed as instances of this approach, where the fixed center point is a sample point, and all sample points are considered each in turn as a fixed center point, for a particular univariate test. In Section 5 we demonstrate in simulations that novel tests based on our approach can have both a power advantage and a great computational advantage over existing multivariate tests. In Section 6 we discuss further extensions. 2 From multivariate to univariate d We use the following result by (21). Let Bd(x; r) = fy 2 < : kx − yk ≤ rg be a ball centered at x with radius r. A complex Radon measure µ, defined formally in Supplementary Material (SM) § D, on <d is said to be of at most exponential-quadratic growth if there exist positive constants A and α αr2 such that jµj(Bd(0; r)) ≤ Ae . Proposition 2.1 (Rawat and Sitaram (21)). Let Γ ⊂ <d be such that the only real analytic function (defined on an open set containing Γ) that vanishes on Γ, is the zero function. Let C = fBd(x; r): x 2 Γ; r > 0g. Then for any complex Radon measure µ on <d of at most exponential-quadratic growth, if µ(C) = 0 for all C 2 C, then it necessarily follows that µ = 0. q For the two-sample problem, let Y 2 < be a random variable with cumulative distribution F1 in q 0 category X = 1, and F2 in category X = 2. For z 2 < , let Fiz be the cumulative distribution function of kY − zk when Y has cumulative distribution Fi, i 2 f1; 2g. We show that if the distribution of Y differs across categories, then so does the distribution of the distance of Y from almost every point z. Therefore, any univariate consistent two-sample test on the distances from z results in a consistent test of the equality of the multivariate distributions F1 and F2, for almost every z. It is straightforward to generalize these results to K > 2 categories. Proofs of all Theorems are in SM § A. q Theorem 2.1. If H0 : F1 = F2 is false, then for every z 2 < , apart from at most a set of Lebesgue 0 0 measure 0, there exists an r > 0 such that F1z(r) 6= F2z(r). Corollary 2.1. For every z 2 <q, apart from at most a set of Lebesgue measure 0, a consistent 0 0 0 two-sample univariate test of the null hypothesis H0 : F1z = F2z will result in a multivariate consistent test of the null hypothesis H0 : F1 = F2. For the multivariate independence test, let X 2 Rp and Y 2 <q be two random vectors with marginal distributions FX and FY , respectively, and with joint distribution FXY . For z = (zx; zy); zx 2 p q 0 0 0 < ; zy 2 < , let FXY z be the joint distribution of (kX − zxk; kY − zyk). Let FXz and FY z be the marginal distribution of kX − zxk and kY − zyk, respectively. p q Theorem 2.2. If H0 : FXY = FX FY is false, then for every zx 2 < ; zy 2 < , apart from at most a 0 0 0 set of Lebesgue measure 0, there exists rx > 0; ry > 0, such that FXY z(rx; ry) 6= FXz(rx)FY z(ry). 2 Corollary 2.2. For every z 2 <p+q, apart from at most a set of Lebesgue measure 0, a consistent 0 0 0 0 univariate test of independence of the null hypothesis H0 : FXY z = FXzFY z will result in a multivariate consistent test of the null hypothesis H0 : FXY = FX FY . We have N independent copies (xi; yi) (i = 1;:::;N) from the joint distribution FXY . The above results motivate the following two-step procedure for the multivariate tests. For the K-sample test, q xi 2 f1;:::;Kg determines the category and yi 2 < is the observation in category xi, so the two-step procedure is to first choose z 2 <q and then to apply a univariate K-sample consistent test on (x1; ky1 − zk);:::; (xN ; kyN − zk). Examples of such univariate tests include the classic Kolmogorov-Smirnov and Cramer-von Mises tests. For the independence test, the two-step procedure p q is to first choose zx 2 < and zy 2 < , and then to apply a univariate consistent independence test on (kx1 − zxk; ky1 − zyk);:::; (kxN − zxk; kyN − zyk). An example of such a univariate test is the classic test of Hoeffding (14). Note that the consistency of a univariate test may be satisfied only under some assumptions on the distribution of the distances of the multivariate vectors. For example, the consistency of (14) follows if the densities of kX − zxk and kY − zyk are continuous. See (11) for additional distribution-free univariate K-sample and independence tests.

Load more