Program Your Own Resampling Statistics
Total Page:16
File Type:pdf, Size:1020Kb
Appendix 1 Program Your Own Resampling Statistics Whether you program from scratch in BASIC, C, or FORTRAN, or use a macro language like that provided by Stata, S-Plus, or SAS, you'll follow the same four basic steps: 1. Read in the observations. 2. Compute the test statistic for these observations. 3. Resample repeatedly. 4. Compare the test statistic with its resampling distribution. Begin by reading in the observations and, even if they form an array, store them in the form of a linear vector for ease in resampling. At the same time, read and store the sample sizes so the samples can be recreated from the vector. For example, if we have two samples 1, 2, 3, 4 and 1, 3, 5, 6, 7, we would store the observations in one vector X = ( 1 2 3 4 1 3 5 6 7) and the sample sizes in another (4 5) so as to be able to mark where the first sample ends and the next begins. Compute the value of the test statistic for the data you just read in. If you are deriving a permutation statistic, several short cuts can help you here. Consider the well-known t-statistic for comparing the means oftwo samples: X-Y t = ----;:::.== /s2 + s2 v X y The denominator is the same for all permuations, so don't wast time computing it, instead use the statistic t' = x- f. A little algebra shows that 202 I. Program Your Own Resampling Statistics 203 But the sum <I: Xi+ L Yi) of all the observations is the same for all permutations, as are the sample sizes nx and ny; eliminating these terms (why do extra extra work if you don't have to?) leaves the sum of the observations in the first sample, a sum that is readily computed. Whatever you choose as your test statistic, eliminate terms that are the same for all permutations before you begin. Four Steps to a Resampling Statistic 1. Get the Data. 2. Compute the Statistic So for the Original Observations. 3. Loop: a. Resample. b. Compute the statistic S for the resample. c. Determine whether S > So. 4. State the p-value. How we resample will depend on whether we use the bootstrap or a permutation test. For the bootstrap, we repeatedly sample with replacement from the original samplefor the permuation test, without. Some examples of program code are given in a sidebar. More extensive examples entailing corrections for bias and acceleration are given in Appendix 3. Computing a Bootstrap Estimate C++ Code #include <stdlib.h> get_data(); //put the variable of interest in the first n //elements of the array X[]. randomize(); //initializes random number generator for (i=O; i<100; i++){ for (j=O; j<n; j++) Y[j]=X[random(n)]; Z[i]=compute_statistic(Y); //compute the statistic for the array Y and store //it in Z compute_stats(Z); 204 1. Program Your Own Resampling Statistics Computing a Bootstrap Estimate GAUSS n = rows(Y); U rnds(n,1, integer seed)); I trunc(n*U + ones(n,1,1)); Ystar = Y[I, .] ; (This routine only generates a single resarnple.) SAS Code PROC IML n = nrow(Y); U = ranuni(J(N,1, integer seed)); I= int(n*U + J(n,1,1)); Ystar = Y(II,I); S-Plus Code "bootstrap"<- function(x, nboot, statistic, ... ){ data<- matrix(sample(x,size=length(x)*nboot, replace=!), nrow=nboot) return(apply(data,1,statistic, ... )) } # where statistic is an S+ function which calculates the #statistic of interest. Stata Code use data, clear bstrap median, args(height), reps(100) where median has been predefined as follows: program define median if ' ' c 1' , ' == ( '?' ' { global 8_1 "median" exit } summarize '2',detail post '1' _result(10) end 1. Program Your Own Resampling Statistics 205 IMPORTANCE SAMPLING We may sample so as to give equal probability to each of the observations or we may use importance sampling in which we give greater weight to some of the observations and less weight to others. The idea behind importance sampling is to reduce the variance of the estimate and, thus, the number of resamples necessary to achieve a given level of confi dence. Suppose we are testing the hypothesis that the population mean is 0 and our original sample contains the observations -2, -1, -0.5, 0, 2, 3, 3.5, 4, 7, and 8. Resamples containing 7 and 8 are much more likely to have large means, so instead of drawing bootstrap samples such that every observation has the same probability 1110 of being included, we'll weight the sample so that larger values are more likely to be drawn, selecting -2 with probability 1/55, -1 with proba bility 2/55, and so forth, with the probability of selecting 8 being 9/55th's. 1 Let I (S* > 0) = 1 if the mean of the bootstrap sample is greater than 0 and 0 oth erwise. The standard estimate of the proportion of bootstrap samples greater than zero is given by 1 B - Ll(Sb > 0). B b=l Because since we have selected the elements of our bootstrap samples with un equal probability, we have to use a weighted estimate of the proportion 1 B fl!O Jrt;b - "l(S* > 0) 1=1 I B ~ b 110 ' b=l . where rr; is the probability of selecting the ith element of the original sample and t;b is the number of times the ith element appears in the bth bootstrap sample. Efron and Tibshirani [1993, p. 355] found a seven-fold reduction in variance using importance sampling with samples of size 10. Importance sampling for bootstrap tail probabilities is discussed in Johns [1988], Hinkley and Shi [1989], and Do and Hall [1991]. Mehta, Patel, and Senchaudhuri [1988] considered importance sampling for permutation tests. REARRANGEMENTS For permutation tests, three alternative approaches to resampling suggest them selves: • Exhaustive-examine all possible resamples. • Focus on the tails-examine only samples more extreme than the original. • Monte Carlo. 11 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 = 55. 206 1. Program Your Own Resampling Statistics Estimating Permutation Test Significance Levels C++ Code for One-Sided Test: int Nmonte; //number of Monte Carlo simulations float statO; //value of test statistic for the unpermuted //observations float X[]; II vector containing the data int n[]; //vector containing the cell/sample sizes int N; II total number of observation proc main() { int i, s, k, seed, fcnt=O; get_data(X,n); stat_orig = statistic(X,n); //compute statistic for the //original sample srand(seed); for (i=O; i < Monte; i++){ for (s = N; s > n[1]; s--) { k = choose(s); //choose a random integer from 0 to s-1 temp X[k]; X[k] = X[s-1]; II put the one you chose at end of //vector X[s-1] = temp; } if (stat_orig >= statistic(X,n)) fcnt++; //count only if the value of the statistic for the //rearranged sample is as or more extreme than its value //for the original sample. } cout << ''alpha='' << fcnt/NMonte; } void get_data(float X, int n){ II This user-written procedure gets all the data and packs //it into a single long linear vector X. The vector n is //packed with the sample sizes. } float statistic(float X, int n){ II This user-written procedure computes and returns the //test statistic. { 1. Program Your Own Resampling Statistics 207 The first two of these alternatives are highly-application specific. Zimmerman [1985a, b] provides algorithms for exhaustive enumeration of permutations. The computationally efficient algorithms developed in Mehta and Patel [1983] and Mehta, Patel, and Gray [1985] underlie the StatXact results for categorical and ordered data reported in in Chapter 7. Algorithms for exact inference in logistic regression are provided in Hirije, Mehta, and Patel [1988]. An accompanying sidebar provides the C++ code for estimating permutation test significance levels via a Monte Carlo. Unlike the bootstrap, permutations re quire we select elements with replacement. The trick (see sidebar) is to stick each selected observation at the end of the vector where it will not be selected a sec ond time. Appendix 2 contains several specific application of this programming approach. Computing a Single Random Rearrangement With SAS: PROC IML n = nrow (Y); U = randuni (J(n,1,seed)); I= rank (U); Ystar = Y(II,I); With GAUSS n = nrows (Y); U = rnds (n,1,seed); I= rankindx (U,1); Ystar = Y [I I, I] ; Appendix 2 C++, SC, and Stata Code for Permutation Tests Simple variations of the C++ program provided in Appendix 1 yield many impor tant test statistics. Estimating Significance Level of the Main Effect of Fertilizer on Crop Yield in a Balanced Design Set aside space for Monte the number of Monte Carlo simulations So the original value of test statistic s test statistic for rearranged data data {5, 10,8, 15,22, 18,21,29,25, 6,9, 12,25,32,40,55,60,48}; n=3 number of observations in each category blocks = 2 number of blocks levels= 3 number of levels of factor MAIN program GET_DATA put all the observations into a single linear vector COMPUTE S_O for the original observations Repeat Monte times: for each block REARRANGE the data in the block COMPUTE S Compare S with S_O PRINT out the proportion of times S was larger than S_O REARRANGE Set s to the number of observations in the block Start: Choose a random integer k from 0 to s-1 Swap X[k] and X[s-1]: 208 2. C++, SC, and Stata Code for Permutation Tests 209 Decrement s and repeat from start Stop after you've selected all but one of the samples.