An Efficient and Distribution-Free Two-Sample Test Based on Energy Statistics and Random Projections
Total Page:16
File Type:pdf, Size:1020Kb
An Efficient and Distribution-Free Two-Sample Test Based on Energy Statistics and Random Projections Cheng Huang and Xiaoming Huo July 14, 2017 Abstract A common disadvantage in existing distribution-free two-sample testing approaches is that the com- putational complexity could be high. Specifically, if the sample size is N, the computational complex- ity of those two-sample tests is at least O(N 2). In this paper, we develop an efficient algorithm with complexity O(N log N) for computing energy statistics in univariate cases. For multivariate cases, we introduce a two-sample test based on energy statistics and random projections, which enjoys the O(KN log N) computational complexity, where K is the number of random projections. We name our method for multivariate cases as Randomly Projected Energy Statistics (RPES). We can show RPES achieves nearly the same test power with energy statistics both theoretically and empirically. Numerical experiments also demonstrate the efficiency of the proposed method over the competitors. 1 Introduction Testing the equality of distributions is one of the most fundamental problems in statistics. Formally, let F and G denote two distribution function in Rp. Given independent and identically distributed samples X ,...,X and Y ,...,Y { 1 n} { 1 m} from two unknown distribution F and G, respectively, the two-sample testing problem is to test hypotheses : F = G v.s. : F = G. H0 H1 6 There are a few recent advances in two sample testing that attract attentions in statistics and machine learning communities. [17] propose a test statistic based on the optimal non-bipartite matching, and, [3] develop a test based on shortest Hamiltonian path, both of which are distribution-free. [8] develop a kernel arXiv:1707.04602v1 [stat.ME] 14 Jul 2017 method based on maximum mean discrepancy. [21], [22] and [1] consider a test statistic based on pairwise distance within the same sample and across two different samples, which also motivates this work. Computational complexity is a common limitation in the aforementioned methods. Let N = n + m denote the size of the two-sample testing problem. The Cross Match (CM) test in [17] requires solving the non-bipartite matching problem, whose computational complexity is: (1) O(N 3) with optimal solution, see [7]; (2) O(N 2) with greedy heuristic. The two-sample test in [3] is based on shortest Hamilton path, which is an NP-complete problem, and its computational complexity is O(N 2 log N) with heuristic method based on Kruskal’s algorithm ([13]). The Maximum Mean Discrepancy (MMD) proposed by [8] requires computing the kernel function values of all pairs of samples, whose complexity is O(N 2). Similarly, the energy statistics based methods in [21] and [1] typical requires the pairwise Euclidean distance, which also costs O(N 2) complexity. 1 As a summary, the computational complexity of the aforementioned two-sample tests is at least O(N 2), which leads to substantial computing time and prohibits their feasibility when the sample size N is too large. As a solution, we develop an efficient algorithm for computing the energy statistics in [21] with complexity O(N log N) for univariate random variables. For multivariate random variables, we propose an efficient algorithm of complexity O(KN log N) with the technique of random projection, where K is the number of random projections. The main idea of the multivariate algorithm is as follows: firstly, we project the data along some random direction; then, we use the univariate fast algorithm to compute the energy statistics with the univariate projected data; lastly, we repeat previous procedure for multiple times and take the average. As we will show in Theorem 4.12, the proposed test statistic based on random projections has nearly the same power with energy statistics. The technique of random projection has been widely used in two-sample testing problems. [15] propose a new method, which firstly projects data along a few random direction; and then, applies the classical Hotelling T 2 statistic, for testing the equality of means in different samples. [20] develop a similar approach based on random projection and the Hotelling T 2 statistic, but the random projection is taken with respect to sample mean vectors and sample covariance matrices. These two papers focus on the problem under multivariate Gaussian settings while our work is more general and does not impose any assumptions in the distributions. The remainder of this paper is organized as follows. We will review the definition and property of energy distance and energy statistics in Section 2. In Section 3, we describe the details of fast algorithms and corresponding two-sample tests. Asymptotic properties of proposed test statistic will be studied in Section 4. In Section 5, we will some numerical examples with simulated data to illustrate the computational and statistical efficiency of the proposed test. Discussions could be found in Section 6 and we will conclude in Section 7. π(p+1)/2 c1cp 1 Throughout this paper, we adopt the following notations. We denote c = and C = − = p Γ((p+1)/2) p cp √πΓ((p+1)/2) as two constants, where Γ( ) denotes the Gamma function. We also denote as the Euclidian Γ(p/2) · | · | norm. For any vector v, vT is its transpose. 2 Review of Energy Distance and Energy Statistics Energy distance is initially proposed by [21] to measure the distance between two multivariate distributions. We follow the definition of energy distance in [23]. Definition 2.1. [23, Definition 1] Suppose X, Y Rp are two real-valued independent random variables ∈ with finite means, i.e., E[ X ] < and E[ Y ] < , then the energy distance between X and Y is defined | | ∞ | | ∞ as (X, Y ) = 2E[ X Y ] E[ X X0 ] E[ Y Y 0 ], E | − | − | − | − | − | where X0 and Y 0 are independent and identical copies of X and Y , respectively. [23] also show that energy distance is equivalent to the weighted L2-distance of the characteristic func- tions. Proposition 2.2. [23, Proposition 1] Suppose X, Y Rp are two real-valued independent random vari- ∈ ables with finite means and X and Y are independent identical copies of X and Y . Let f˜ ( ) and f˜ ( ) 0 0 X · Y · denote the characteristic function of X and Y , respectively, we have 2 1 f˜X (t) f˜Y (t) | −2 | dt = 2E[ X Y ] E[ X X0 ] E[ Y Y 0 ] = (X, Y ). c p t | − | − | − | − | − | E p ZR | | Thus, (X, Y ) 0 with equality to zero if and only if X and Y are identically distributed, i.e., f˜ f˜ . E ≥ X ≡ Y 2 i.i.d. i.i.d. Suppose that we observe samples X ,...,X F and Y ,...,Y G, the energy statistics is 1 n ∼ 1 m ∼ usually defined as follows (see [23], (6.1)). n m n n m m 2 1 1 X Y X X Y Y . nm | i − j| − n2 | i − j| − m2 | i − j| Xi=1 Xj=1 Xi=1 Xj=1 Xi=1 Xj=1 However, above estimator is NOT an unbiased estimator of (X, Y ). To mitigate this issue, let h(X1,X2,Y1,Y2) = 2 2 E 1 2 Xi Yj X1 X2 Y1 Y2 be a two-sample kernel (see [24, Chapter 12.2]), which an i=1 j=1 | − | − | − | − | − | unbiasedP P estimator, i.e., E[h] = (X, Y ), then it is easy to verify that E 1 n m h(Xi1 ,Xi2 ,Yj1 ,Yj2 ) 2 2 i1<iX2,j1<j2 n m n m 2 1 1 = X Y X X Y Y . nm | i − j| − n(n 1) | i − j| − m(m 1) | i − j| i=1 j=1 i,j=1,i=j i,j=1,i=j X X − X6 − X6 is a U-statistic and an unbiased estimator of (X, Y ). Thus, we will use the following definition of energy E statistics throughout this paper. i.i.d. i.i.d. Definition 2.3 (Unbiased Energy Statistics). Given samples X1,...,Xn F and Y1,...,Ym G, the ∼ ∼ energy statistics between X and Y could be defined as n m n m 2 1 1 (X, Y ) = X Y X X Y Y . En,m nm | i − j| − n(n 1) | i − j| − m(m 1) | i − j| i=1 j=1 i,j=1,i=j i,j=1,i=j X X − X6 − X6 3 Efficient Computational Method for Energy Statistics In this section, we will describe the efficient algorithms for energy statistics of both univariate and multi- variate random variables in Section 3.1 and Section 3.2, respectively. We will also propose two different methods based on the efficient algorithm of multivariate random variables for two-sample test in Section 3.3. 3.1 A Fast Algorithm for Univariate Random Variables We will start with the fast algorithm for univariate random variables. Let us recall the definition of energy statistics first. Given univariate random variables X1,...,Xn R and Y1,...,Ym R, the energy statistic ∈ ∈ of X and Y is defined below: n m n m 2 1 1 (X, Y ) = X Y X X Y Y . En,m nm | i − j| − n(n 1) | i − j| − m(m 1) | i − j| i=1 j=1 i,j=1,i=j i,j=1,i=j X X − X6 − X6 For simplicity of notation, we denote above term with . The following algorithm can compute En,m En,m with an average order of complexity O(N log N), where N = n + m. The main idea of this algorithm is sorting the observations first and use a linear-time algorithm to compute the energy statistic with sorted observations.