SOME NONPARAMETRIC TESTS 1. Introduction in Parametric Statistics

SOME NONPARAMETRIC TESTS 1. Introduction In parametric statistics, we may consider particular parametric families, such as the normal distribution in testing for equality of variances via F tests, or for equality of means via t tests or analysis of variance. In regression, the assumption of i.i.d. N(0,σ2) errors is used in testing whether regression coefficients are significantly different from 0. The Wilks test applies to more general families of distributions in- dexed by θ in a finite-dimensional parameter space Θ. Similarly, the χ2 test of composite hypotheses applies to fairly general parametric subfamilies of multinomial distributions. In nonparametric statistics, there actually still are parameters in a sense, such as the median m or other quantiles, but we don’t have distributions determined uniquely by such a parameter. Instead there are more general restrictions on the distribution function F of the observations, such as that F is continuous. So the families of possible distributions are infinite-dimensional. Given n observations X1,...,Xn, we can always form their order statistics X(1) X(2) X(n) and their ≤1 n ≤···≤ empirical distribution function Fn(x) := n j=1 1Xj ≤x. Some tests can be based on these. This handout will considerP two tests of whether two samples both came from the same (unknown) distribution, the Mann– Whitney–Wilcoxon rank-sum test and the Kolmogorov–Smirnov test. Also we will have the Wilcoxon “signed rank” test of whether paired variables (Xj,Yj) have the same distribution as (Yj,Xj). 2. Symmetry of random variables Two real random variables X and Y are said to have the same distribution or to be equal in distribution, written X =d Y , if for some F , Pr(X x) = Pr(Y x) = F (x) for all x. If X =d Y , then for any ≤ ≤ constant c, X + c =d Y + c. But one cannot necessarily add the same random variable to both sides of an equality in distribution. Let X and Y be i.i.d. N(0, 1). Then X =d Y , but X + X =d X + Y because X + X =2X is N(0, 4), but X + Y is N(0, 2). 6 A real random variable X is said to have a symmetric distribution (around 0) or to be symmetric if X and X have the same distribution. − Date: 18.650, Nov. 20, 2015. 1 SOMENONPARAMETRICTESTS 2 Then, given a real number m, (the distribution of) X is said to be symmetric around m if X m is symmetric around 0, i.e. X m and m X have the same distribution.− Equivalently, X and 2m −X have the− same distribution. − Examples. Any N(µ,σ2) is symmetric around µ. Any t(d) distribution is symmetric around 0. A Beta(a,b) distribution is symmetric around m if and only if both m = 1/2 and a = b. If X has a density f with f(x) > 0 for all x> 0 and f(x) = 0forall x 0, then X is not symmetric around any m. This applies, for example,≤ to gamma distributions, including χ2 and exponential distributions, and to F distributions. Fact 1. Suppose a random variable X is symmetric around some m. (a) Then m is a median of X. (b) Actually m is the median of X, in the sense that if X has a non- degenerate interval of medians, m must be the midpoint of that interval. (c) If E X < + then also EX = m. | | ∞ Proof. (a) Let X be symmetric around m. Then from the definitions, Pr(X m) = Pr(X m 0) = Pr(m X 0) = Pr(X m). ≥ − ≥ − ≥ ≤ Then 2Pr(X m) = Pr(X m)+Pr(X m)=1+Pr(X = m) 1 and so Pr(X ≤ m) = Pr(X ≤ m) 1/2 and≥ m is indeed a median≥ of X. ≤ ≥ ≥ (b) Suppose for some c = 0, m + c is a median of X. Then c is a median of X m, so it is6 also a median of m X. It follows that c is a median of−X m, so m c is a median of− X. So the interval− of medians of X is symmetric− around− m and m is the midpoint of it. (c) E(X m)= EX m = E(m X)= m EX, so2EX =2m and EX = m−. − − − Example. Consider the binomial(5, 1/2) distribution. Its distribution function F satisfies F (x)=1/2, 2 x< 3. So it has an interval [2, 3] of medians. The distribution is symmetric≤ around 5/2=2.5, the median as the midpoint of the interval of medians, and also the mean. 3. The Mann–Whitney–Wilcoxon rank-sum test This is a test of whether two samples come from the same distribution, against the alternative that members of one sample tend to be larger than those of the other sample (a location or shift alternative). No parametric form of the distributions is assumed. They can be quite general, as long as the distribution functions are continuous. One might want to use such a test, called a nonparametric test, if, for example, SOMENONPARAMETRICTESTS 3 the data have outliers and so appear not to be normally distributed. Rice considers this test in subsection 11.2.3 pp. 435–443. The general assumption for the test is that real random variables X1,... , Xm are i.i.d. with a distribution function F , and independent of Y1,...,Yn which are i.i.d. with another distribution function G, with both F and G continuous. The hypothesis to be tested is H0: F = G. The test works as follows: let N = m + n and combine the samples of X’s and Y ’s into a total sample Z1,...,ZN . Arrange the Zk in order (take their order statistics) to get Z(1) < Z(2) < < Z(N). With probability 1, no two of the order statistics are equal··· because F and G are continuous. Let rank(V )= k if V = Z(k) for V = Xi or Yj. Let m TX = i=1rank(Xi). Then TX will be the test statistic. H0 will be rejectedP if either TX is too small, indicating that the X’s tend to be less than the Y ’s, or if TX is too large, indicating that the Y ’s tend to be less than the X’s. To determine quantitatively what values are too small or too large, we need to look into the distribution of TX under H0. If H0 holds then Z1,...,ZN are i.i.d. with distribution F = G. Let E0 be expectation, and Var0 the variance, when the hypothesis H0 is true. Let Ri be the rank of Xi. Then Ri has the discrete uniform distribution on 1, 2,...,N , Pr(Ri = k)=1/N for k =1,...,N. This { } 1 N(N+1) N+1 distribution has mean E0Ri = N 2 = 2 . A variable U with this distribution has N 1 1 N(N + 1)(2N + 1) (N + 1)(2N + 1) (1) E(U 2)= k2 = = . N N 6 6 Xk=1 It follows that the variance Var0(Ri) of the distribution is (2) (N + 1)(2N + 1) (N + 1)2 4N 2 +6N +2 3N 2 6N 3 N 2 1 = − − − = − . 6 − 4 12 12 Recalling that the continuous U[a,b] distribution has a variance (b a)2/12, the 12 in the denominator is to be expected. Moreover, − let X have the discrete uniform distribution on 1,...,N , as each Ri does. Let V have a U[ 1/2, 1/2] distribution and{ be independent} of − 1 X. Then X + V is easily seen to have a U[1/2,N + 2 ] distribution, so Var(X +V )= N 2/12, while by independence, Var(X +V ) = Var(X)+ Var(V ) = Var(X)+1/12, so Var(X)=(N 2 1)/12, giving another proof of (2). − We know the mean and variance of each rank Ri for i = 1,...,m. m To find the mean and variance of the sum TX = i=1 Ri, the mean is easy, namely E0TX = m(N + 1)/2. For the variance,P we need to find SOMENONPARAMETRICTESTS 4 the covariances of ranks Ri and Rj for i = j, all of which equal the 6 covariance of R1 and R2. These two ranks are not independent because they cannot have the same value. First we find E0(R1R2). To make this easier we can express it as E0(R1E0(R2 R1)). (Using the conditional expectation breaks the calculation into two| easier ones.) We have 1 N(N + 1) E (R R )= R 0 2| 1 N 1 2 − 1 − because, given R1, R2 can have any of the N 1 values in 1, 2,...,N other than R , each with probability 1/(N −1). It follows{ by (1) that} 1 − 1 N(N + 1) N +1 2N 2 +3N +1 E (R R )= E (R E (R R )) = 0 1 2 0 1 0 2| 1 N 1 2 · 2 − 6 − 1 3N 3 +6N 2 +3N (4N 2 +6N + 2) = − N 1 12 − 1 3N 3 +2N 2 3N 2 (3N + 2)(N 2 1) (3N + 2)(N + 1) = − − = − = . N 1 12 12(N 1) 12 − − Thus under H0 the covariance of R1 and R2 is 3N 2 +5N +2 N 2 +2N +1 E (R R ) (E R )2 = 0 1 2 − 0 1 12 − 4 3N 2 +5N +2 3N 2 6N 3 N +1 = − − − = , 12 − 12 and so for 1 i < k m ≤ ≤ N +1 (3) Cov (Ri,Rk) = Cov (R ,R )= .

Load more