Program Your Own Resampling Statistics

APPENDIX 1 Program Your Own Resampling Statistics Whether you program from scratch in BASIC, C, or FORTRAN or use a macro language like that provided by Stata, S-Plus, or SAS, you'll follow the same four basic steps: 1. Read in the observations. 2. Compute the test statistic for these observations. 3. Resample repeatedly. 4. Compare the test statistic with its resampling distribution. Begin by reading in the observations and, even if they form an array, store them in the form ofa linear vector for ease in resampling. At the same time, read and store the sample sizes so the samples can be re-created from the vector. For example, if we have two samples 1,2,3,4 and 1,3,5,6, 7, we would store the observations in one vector X = (1 2 3 4 1 3 5 6 7) and the sample sizes in another (4 5) so as to be able to mark where the first sample ends and the next begins. Compute the value of the test statistic for the data you just read in. If you are deriving a permutation statistic, several shortcuts can help you here. Consider the well-known t-statistic for comparing the means of two samples: X-y t= -;=== Js.~ + s.~ The denominator is the same for all permutations, so don't waste time computing it, instead use the statistic t' = X - 1'. A little algebra shows that t , =~Xi (1-+- 1) -(~Xi+~Yi)-·1 nx lly n, I. Program Your Own Resampling Statistics 209 Four Steps to a Resampling Statistic 1. Get the data. 2. Compute the statistic So for the original observations. 3. Loop: a. Resample. b. Compute the statistic S for the Resample. c. Determine whether S > So. 4. State the p-value. But the sum (~Xi + ~ Yi) of all the observations is the same for all permutations, as are the sample sizes n, and fl.,; eliminating these terms (why do extra work if you don't need to?) leaves the sum of the observations in the first sample, a sum that is readily computed. Whatever you choose as your test statistic, eliminate terms that are the same for all permutations before you begin. How we resample will depend on whether we use the bootstrap or a permutation test. For the bootstrap, we repeatedly sample without replacement from the original sample. Some examples of program code are given in a sidebar. More extensive examples entailing corrections for bias and acceleration are given in Appendix 3. IMPORTANCE SAMPLING We may sample so as to give equal probability to each of the observations or we may use importance sampling in which we give greater weight to some of the observations and less weight to others. Computing a Bootstrap Estimate C++ Code: #include <stdlib.h> get_data(); //put the variable of interest in the first n //elements of the array X[]. randomize(); //initializes random number generator for (i=O; i<100; i++){ for (j=O; j<n; j++) Y[j]=X[random(n)]; Z[i]=compute_statistic(Y); //compute the statistic for the array Y and store flit in Z compute_stats(Z); 210 I. Program Your Own Resampling Statistics Computing a Bootstrap Estimate GAUSS: n = rows(Y); U rnds(n,i, integer seed)); I trunc(n*U + ones(n,i,i)); Ystar = Y[I,.]; (This routine only generates a single resample.) SAS Code: PROC 1ML n = nrow(Y); U ranuni(J(N,i, integer seed)); I int(n*U + J(n,i,i)); Ystar = Y( I I, I); (This routine only generates a single resample. A set of comprehensive algorithms in the form of SAS macros is provided in Appendix 3.) S-Plus Code: "bootstrap" <- function(x, nboot, statistic, ... ){ data <- matrix(sample(x,size=length(x)*nboot, replace=T), nrow=nboot) return(apply(data,i,statistic, ... )) } # where statistic is an 8+ function which calculates the # statistic of interest. Stata Code: use data, clear bstrap median, args(height) , reps(iOO) where median has been predefined as follows: program define median if '" i ' " == "?', { global 8_i "median" exit } summarize '2' ,detail post 'i' _result(iO) end I. Program Your Own Resampling Statistics 211 The idea behind importance sampling is to reduce the variance of the estimate and, thus, the number of resamples necessary to achieve a given level of confidence. Suppose we are testing the hypothesis that the population mean is 0 and our original sample contains the observations -2, -1, -0.5,0,2,3,3.5,4,7, and 8. Resamples containing 7 and 8 are much more likely to have large means, so instead of drawing bootstrap samples such that every observation has the same probability 1IlO of being included, we'll weight the sample so that larger values are more likely to be drawn, selecting -2 with probability 1155, -1 with probability 2/55, and so forth, with the probability of selecting 8 being 9/55's.1 Let I(S* > 0) = 1 if the mean of the bootstrap sample is greater than 0 and 0 otherwise. The standard estimate of the proportion of bootstrap samples greater than zero is given by 1 B - L1(Sh* > 0). B h=1 Because since we have selected the elements ofour bootstrap samples with unequal probability, we have to use a weighted estimate of the proportion I B nlO fib _ " I(S * > 0) ;=1 7r; B ~ b 110 h=1 . where 7r; is the probability of selecting the ith element of the original sample and f;b is the number of times the ith element appears in the hth bootstrap sample. Efron and Tibshirani [1993, p. 355] found a sevenfold reduction in variance using importance sampling with samples of size lO. Importance sampling for bootstrap tail probabilities is discussed in Johns [1988], Hinkley and Shi [1989], and Do and Hall [1991]. Mehta, Patel, and Senchaudhuri [1988] considered importance sampling for permutation tests. SAMPLING WITH REPLACEMENT For permutation tests, three alternative approaches to resampling suggest themselves: • Exhaustiv~xamine all possible resamples. • Focus on the tail~xamine only samples more extreme than the original. • Monte Carlo. The first two of these alternatives are highly-application specific. Zimmerman [1985a,b] provides algorithms for exhaustive enumeration of permutations. The computationally efficient algorithms developed in Mehta and Patel [1983] and Mehta, Patel, and Gray [1985] underlie the StatXact results for categorical and ordered data reported in in Chapter 7. Algorithms for exact inference in logistic regression are provided in Hirlje, Metha, and Patel [1988]. An accompanying sidebar provides the C++ code for estimating permutation test significance levels via a Monte Carlo. Unlike the bootstrap, permutations 1 I + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 = 55. 212 I. Program Your Own Resampling Statistics Estimating Permutation Test Significance Levels C++ Code for One-Sided Test: int Nmonte; II number of Monte Carlo simulations float statO; II value of test statistic for the unpermuted II observations float X[]; II vector containing the data int n[]; II vector containing the cellisample sizes int N; II total number of observation proc mainO { int i, s, k, seed, fcnt=O; get_data(X,n); stat_orig = statistic(X,n); II compute statistic for the II original sample srand(seed); for (i=O; i < Monte; i++){ for (s = N; s > n[1]; s--) { k = choose(s); II choose a random integer from 0 to s-1 temp X[k] ; X[k] = X[s-1]; II put the one you chose at end of II vector X[s-i] = temp; } if (stat_orig >= statistic(X,n)) fcnt++; II count only if the value of the statistic for the II rearranged sample is as or more extreme than its value II for the original sample. } cout « "alpha = " « fcnt/NMonte; } void get_data(float X, int n){ II This user-written procedure gets all the data and packs II it into a single long linear vector X. The vector n is II packed with the sample sizes. } float statistic(float X, int n){ II This user-written procedure computes and returns the II test statistic { I. Program Your Own Resampling Statistics 213 Computing a Single Random Rearrangement WithSAS: PROC lML n = nrow (Y); U = randuni (J(n,i,seed»; I = rank (U); Ystar = Y( I I, I) ; With GAUSS n = nrows (Y); U = rnds (n,i,seed); I = rankindx (U,i); Ystar = nil, I]; require that we select elements with replacement. The trick (see sidebar) is to stick each selected observation at the end of the vector, where it will not be selected a second time. Appendix 2 contains several specific applications ofthis programming approach. APPENDIX 2 C++, SC, and Stata Code for Permutation Tests Simple variations of the C++ program provided in Appendix 1 yield many important test statistics. Estimating Significance Level ofthe Main Effect ofFertilizer on Crop Yield in a Balanced Design Set aside space for Monte the number of Monte Carlo simulations 50 the original value of test statistic 5 test statistic for rearranged data data {5, 10,8,15,22,18,21,29,25, 6,9,12,25,32,40,55,60,48}; n=3 number of observations in each category blocks = 2 number of blocks levels = 3 number oflevels of factor Main program: GELDATA put all the observations into a single linear vector COMPUTE S_O for the original observations Repeat Monte times: for each block REARRANGE the data in the block COMPUTE S Compare S with S_O PRINT out the proportion of times S was larger than S_O REARRANGE 2. C++, SC, and Stata Code for Permutation Tests 215 Set s to the number of observations in the block Start: Choose a random integer k from ° to s-1 Swap X[k] and X[s-1] : Decrement s and repeat from start Stop after you've selected all but one of the samples.

Program Your Own Resampling Statistics

Wu Washington 0250E 15755.Pdf (3.086Mb)

Analysing Census at School Data

A Monte Carlo Simulation Approach to the Reliability Modeling of the Beam Permit System of Relativistic Heavy Ion Collider (Rhic) at Bnl* P

2 3 Practice Problems.Pdf

Type Here Title of the Paper

Revisiting the Sustainable Happiness Model and Pie Chart: Can Happiness Be Successfully Pursued?

Doing More with SAS/ASSIST® 9.1 the Correct Bibliographic Citation for This Manual Is As Follows: SAS Institute Inc

Miro Correlates with Other Psychometrics 13 Your Miro Practitioner 14

Statistical Graphics Considerations January 7, 2020

No Humble Pie: the Origins and Usage of a Statistical Chart

A Guide to Using Data for Health Care Quality Improvement

Redesigning Google Form Responses