5 Introduction to the Theory of Order Statistics and Rank Statistics • This
Total Page:16
File Type:pdf, Size:1020Kb
5 Introduction to the Theory of Order Statistics and Rank Statistics • This section will contain a summary of important definitions and theorems that will be useful for understanding the theory of order and rank statistics. In particular, results will be presented for linear rank statistics. • Many nonparametric tests are based on test statistics that are linear rank statistics. { For one sample: The Wilcoxon-Signed Rank Test is based on a linear rank statistic. { For two samples: The Mann-Whitney-Wilcoxon Test, the Median Test, the Ansari- Bradley Test, and the Siegel-Tukey Test are based on linear rank statistics. • Most of the information in this section can be found in Randles and Wolfe (1979). 5.1 Order Statistics • Let X1;X2;:::;Xn be a random sample of continuous random variables having cdf F (x) and pdf f(x). th • Let X(i) be the i smallest random variable (i = 1; 2; : : : ; n). • X(1);X(2);:::;X(n) are referred to as the order statistics for X1;X2;:::;Xn. By defini- tion, X(1) < X(2) < ··· < X(n). Theorem 5.1: Let X(1) < X(2) < ··· < X(n) be the order statistics for a random sample from a distribution with cdf F (x) and pdf f(x). The joint density for the order statistics is n Y g(x(1); x(2); : : : ; x(n)) = n! f(x(i)) for − 1 < x(1) < x(2) < ··· < x(n) < 1 (16) i=1 = 0 otherwise th Theorem 5.2: The marginal density for the j order statistic X(j) (j = 1; 2; : : : ; n) is n! g (t) = [F (t)]j−1 [1 − F (t)]n−j f(t) − 1 < t < 1: j (j − 1)!(n − j)! • For random variable X with cdf F (x), the inverse distribution F −1(·) is defined as F −1(y) = inffx : F (x) ≥ yg 0 < y < 1: • If F (x) is strictly increasing between 0 and 1, then there is only one x such that F (x) = y. In this case, F −1(y) = x. Theorem 5.3 (Probability Integral Transformation): Let X be a continuous random variable with distribution function F (x). The random variable Y = F (X) is uniformly dis- tributed on (0; 1). • Let X(1) < X(2) < ··· < X(n) be the order statistics for a random sample from a continuous distribution. Application of Theorem 5.3, implies that F (X(1)) < F (X(2)) < ··· < F (X(n)) are distributed as the order statistics from a uniform distribution on (0; 1). 75 • Let Vj = F (X(j) for j = 1; 2; : : : ; n. Then, by Theorem 5.2, the marginal density for each Vj has the form n! g (t) = tj−1 [1 − t]n−j − 1 < t < 1 j (j − 1)!(n − j)! because F (t) = t and f(t) = 1 for a uniform distribution on (0; 1). • Thus, Vj has a beta distribution with parameters α = j and β = n − j + 1. Therefore, the moments of Vj are n! Γ(r + j) E(V r) = j (j − 1)! Γ(n + r + 1) where Γ(k) = (k − 1)!. th • Thus, when Vj is the j order statistic from a uniform distribution, j j(n − j + 1) E(V ) = V ar(V ) = j n + 1 j (n + 1)2(n + 2) Simulation to Demonstrate Theorem 5.3 (Probability Integral Transformation) Case 1: N(0; 1) Distribution 1. Generate a random sample (x1; x2; : : : ; x5000) of 5000 values from a normal N(0; 1) distri- bution. 2. Determine the 5000 empirical cdf Fb(xi) values. 3. Plot the histograms and empirical cdf of the original N(0; 1) sample. Note how they represent a sample from a standard normal distribution. 4. Plot the histograms and empirical cdf of the Fb(xi) values. Note the histograms and empirical cdf of the Fb(xi) values represent a sample from a uniform U(0; 1) distribution (as supported by Theorem 5.3). Case 2: Exp(4) Distribution 1. Generate a random sample (x1; x2; : : : ; x5000) of 5000 values from an exponential Exp(4) distribution. 2. Determine the 5000 empirical cdf Fb(xi) values. 3. Plot the histograms and empirical cdf of the original Exp(4) sample. Note how they represent a sample from an exponential Exp(4) distribution. 4. Plot the histograms and empirical cdf of the Fb(xi) values. Note the histograms and empirical cdf of the Fb(xi) values represent a sample from a uniform U(0; 1) distribution (as supported by Theorem 5.3). 76 Histogram of N(0,1) Sample Histogram of CDF of N(0,1) Sample) 500 800 400 600 300 400 Frequency Frequency 200 200 100 0 0 −2 0 2 4 0.0 0.2 0.4 0.6 0.8 1.0 x1 Fx1 ECDF of N(0,1) Sample ECDF(ECDF of N(0,1) Sample) 1.0 1.0 0.8 0.8 0.6 0.6 Fn(x) Fn(x) 0.4 0.4 0.2 0.2 0.0 0.0 −4 −2 0 2 4 0.0 0.2 0.4 0.6 0.8 1.0 x x Histogram of Exp(4) Sample Histogram of CDF of Exp(4) Sample) 500 2500 300 1500 Frequency Frequency 100 500 0 0 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 x2 Fx2 ECDF of Exp(4) Sample ECDF(ECDF of Exp(4) Sample) 1.0 1.0 0.8 0.8 0.6 0.6 Fn(x) Fn(x) 0.4 0.4 0.2 0.2 0.0 0.0 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 x x 77 R Code for Simulation of Theorem 5.3 (Probability Integral Transformation) n = 5000 # size of random sample # CASE 1: Random Samples from N(0,1) Distribution x1 <- rnorm(n,0,1) x1[1:10] # view first 10 values Fx1 <- pnorm(x1) Fx1[1:10] windows() par(mfrow=c(2,2)) hist(x1,main="Histogram of N(0,1) Sample") hist(Fx1,main="Histogram of CDF of N(0,1) Sample)") plot(ecdf(x1),main="ECDF of N(0,1) Sample") plot(ecdf(Fx1),main="ECDF(ECDF of N(0,1) Sample)") # CASE 2: Random Samples from Exponential(4) Distribution x2 <- rexp(n,4) x2[1:10] # view first 10 values Fx2 <- pexp(x2,4) Fx2[1:10] windows() par(mfrow=c(2,2)) hist(x2,main="Histogram of Exp(4) Sample") hist(Fx2,main="Histogram of CDF of Exp(4) Sample)") plot(ecdf(x2),main="ECDF of Exp(4) Sample") plot(ecdf(Fx2),main="ECDF(ECDF of Exp(4) Sample)") 5.2 Equal-in-Distribution Results • Two random variables S and T are equal in distribution if S and T have the same cdf. To denote equal in distribution, we write S =d T . Theorem 5.4 A random variable X has a distribution that is symmetric about some number µ if and only if (X − µ) =d (µ − X). Theorem 5.5 Let X1;X2;:::;Xn be independent and identically distributed (i.i.d.) random variables. Let (α1; α2; : : : ; αn) denote any permutation of the integers (1; 2; : : : ; n). Then d (X1;X2;:::;Xn) = (Xα1 ;Xα2 ;:::;Xαn ). • A set of random variables X1;X2;:::;Xn is exchangeable if for every permutation (α1; α2; : : : ; αn) of the integers 1; 2; : : : ; n, d (X1;X2;:::;Xn) = (Xα1 ;Xα2 ;:::;Xαn ): • If X1;X2;:::;Xn are i.i.d random variables, then the set X1;X2;:::;Xn is exchangeable. • The statistic t(·) is 1. a translation statistic if t(x1 + k; x2 + k; : : : ; xn + k) = t(x1; x2; : : : ; xn) + k 2. a translation-invariant statistic if t(x1 +k; x2 +k; : : : ; xn +k) = t(x1; x2; : : : ; xn) for every k and x1; x2; : : : ; xn. 78 5.3 Ranking Statistics • Let Z1;Z2;:::;Zn be a random sample from a continuous distribution with cdf F (z), and let Z(1) < Z(2) < ··· < Z(n) be the corresponding order statistics. th • Zi has rank Ri among Z1;Z2;:::;Zn if Zi = Z(Ri) assuming the Ri order statistic is uniquely defined. • By \uniquely defined” we are assuming that ties are not possible. That is, Z(i) 6= Z(j) for all i 6= j. • Let R = fr : r is a permutation of the integers (1; 2; : : : ; n)g. That is, R is the set of all permutations of the integers (1; 2; : : : ; n). Theorem 5.6 Let R = (R1;R2;:::;Rn) be the vector of ranks where Ri is the rank of Zi among Z1;Z2;:::;Zn. Then R is uniformly distributed over R. That is, P (R = r) = 1=n! for each permutation r. Theorem 5.7 Let Z1;Z2;:::;Zn be a random sample from a continuous distribution, and let R be the corresponding vector of ranks where Ri is the rank of Zi for i = 1; 2; : : : ; n. Then P [Ri = r] = 1=n for r = 1; 2; : : : ; n = 0 otherwise and, for i 6= j, 1 P [R = r ; R = s] = for r 6= s; r; s = 1; 2; : : : ; n i j n(n − 1) = 0 otherwise Corollary 5.8 Let R be the vector of ranks corresponding to a random sample from a continuous distribution. Then n + 1 (n + 1)(n − 1) E[R ] = and V ar[R ] = for i = 1; 2; : : : ; n i 2 i 12 −(n + 1) Cov[R ;R ] = for i 6= j: i j 12 • Let V1;V2;:::;Vn be random variables with joint distribution function D, where D is a member of some collection A of possible joint distributions.