<<

Resampling Procedures | Real Using Excel http://www.real-statistics.com/non-parametric-tests/resampling-procedures/

Real Statistics Using Excel Everything you need to do real statistical analysis using Excel

Resampling Procedures

Resampling procedures are based on the assumption that the underlying population distribution is th as a given . The approach is to create a large number of samples from this pseudo-population us techniques described in and then draw some conclusions from some (, , etc. sample.

Resampling is generally simple to implement and doesn’t require complicated formulas. Unlike para techniques, few assumptions are made (e.g. doesn’t need to be normal and samples don’t necessarily n be large). Resampling is useful when the population distribution is unknown or other techniques are not av

We consider two types of resampling procedures: , where sampling is done with replaceme permutation (also known as tests), where sampling is done without replacement. Ge bootstrapping is used for determining confidence intervals of some parameter, while randomization is u hypothesis testing.

One sample case

Suppose that we would like to calculate a for the median. Since there are no st statistical tests for such confidence intervals, we approach the problem via bootstrapping as described following example.

Example 1: Calculate a 95% confidence interval around the median for the memory loss program de in Example 1 of the , but with the data given in columns A and B of Figure 1.

1 of 9 3/15/2020, 5:24 PM Resampling Procedures | Real Statistics Using Excel http://www.real-statistics.com/non-parametric-tests/resampling-procedures/

Figure 1 – Resampling – One sample case

The sample has a mean of 9 and a median of 9.5.

We treat the sample as the population and draw 2,000 samples of size 20 (the same size as the original s with replacement. Referring to Figure 1, D4:W4 represents the first sample, D5:W5 the second, etc element in each sample is selected using the following function:

=INDEX(B4:B23,RANDBETWEEN(1,20))

We now take the median of each of the 2,000 samples (only the first 21 samples are shown in Figure 1). E X4 contains the formula =MEDIAN(D4:W4). Next we plot the distribution of the (i.e. range X4:X in a using Excel’s Histogram data analysis tool (or Excel’s charting capability), augmente percentage and cumulative % columns. The results are shown in Figure 2.

Figure 2 – Analysis for Example 1

The value at the 2.5% is 7 and the value at the 97.5% percentile is 10. Thus we can consid confidence interval as [7, 11], which contains the sample median of 9.5.

Observation: Instead of using the formula =INDEX(B4:B23,RANDBETWEEN(1,20)), we could use the f RANDOMIZE(B4:B23) based on the Real Statistics array function RANDOMIZE to select a sample of 2 elements with replacement.

Two independent samples

We now consider the case where we have two independent samples. When the data is normally distribut would use the t-test (for independent samples with equal or with unequal variances). We can a the Wilcoxon Rank Sum or Mann-Whitney non-parametric test. We now show how to address such pr using the permutation version of resampling.

2 of 9 3/15/2020, 5:24 PM Resampling Procedures | Real Statistics Using Excel http://www.real-statistics.com/non-parametric-tests/resampling-procedures/

Example 2: Using resampling determine whether there is a significant difference between the medi expectancy of smokers and non-smokers using the data described in Figure 3 (this is Example 3 fro Wilcoxon Rank Sum Test).

Figure 3 – Data for Example 2

Note that the median score of the non-smokers is 76.5 while the median score of smokers is 70.5, a differen

The null hypothesis is that there is no difference between the two groups, i.e.

H0 : the median score for the population of smokers and non-smokers are the same.

Based on the null hypothesis, we can assume that we have a single population of 78 (represented by the com sample of smokers and non-smokers). To test the hypothesis we take 2,000 random samples of size 78 fro population without replacement and assume that for each sample the first 40 scores come from the non-sm and the remaining 38 come from the smokers.

To draw these samples we use the approach described in Sampling, namely we use formulas of form

=INDEX(J4:CI4,1,RANK(DC6,DC6:GB6))

where the range J4:CI4 contains all 78 data elements in the “population” and DC6:GB6 contains 78 r numbers, generated using RAND(). For each of the 2,000 samples we calculate the median of the non-sm and smokers and record the difference. A histogram of these median differences is provided in Figure 4.

3 of 9 3/15/2020, 5:24 PM Resampling Procedures | Real Statistics Using Excel http://www.real-statistics.com/non-parametric-tests/resampling-procedures/

Figure 4 – Resampling for two independent samples

Now we need to check whether the mean difference of the original sample is in the extreme 5% of the tota left and right tails of the sampling table (2-tail test). From Figure 4, we see that 1.60% of the s have a median difference of -6 or less and 4.90% of the samples have a median difference of 6 or more, for of 6.50%. This that the probability of getting a sample in either tail based on the null hypothesis is .05 = α , and so we cannot reject the null hypothesis and cannot conclude with 95% confidence that the significant difference between the life expectancy of smokers and non-smokers.

Observation: If we had used a one tail test, then p-value = .049 < .05 = α and so we would just barely rej null hypothesis.

In the previous example we chose to test the median. Using the same technique, we could have chosen to t mean instead.

Observation: Instead of using the formula =INDEX(J4:CI4,1,RANK(DC6,DC6:GB6))), we could use the f SHUFFLE(J4:CI4) based on the Real Statistics array function SHUFFLE to select a sample from the orig data elements without replacement.

Two matched samples

We now consider the case where we have two matched samples. When the data is normally distributed (or symmetric), we would use the Paired Sample t-test. Even for non-normal data we can use the Wilcoxon S Ranks non-parametric test. We now show how to address such problems using resampling techniques.

Example 3: Using resampling determine whether there is a significant difference between the medi expectancy of smokers and non-smokers using the data described in Figure 3 (this is Example the Wilcoxon Signed-Ranks Test for Paired Samples)

The null hypothesis is there is no difference between a person’s ability to identify objects with their right ey their ability with their left eye, i.e. the median difference is zero. As we have seen previously the data is and so it might be better not to use the t-test. We will use resampling and assume that the population is as sample.

If the null hypothesis is true then each of the 15 scores for the right eye is just as likely to be larger as small the scores for their left eye, and so we can randomly exchange the scores of each person’s eyes. This is equ to randomly changing the sign of the difference between the scores. Thus, we take 2,000 samples each of (the size of the sample) using the sample data but randomly assigning the sign of the difference as pos

4 of 9 3/15/2020, 5:24 PM Resampling Procedures | Real Statistics Using Excel http://www.real-statistics.com/non-parametric-tests/resampling-procedures/

negative (with a 50% probability of each outcome). This is a form of sampling without replacement. The absolute values of the elements in each sample are as population, only the signs vary.

Figure 5 – Resampling for paired samples

Figure 5 shows the first 16 samples (out of 2,000). The range F3:T3 contains the differences of the origin for the first sample. Each of the 15 data elements in the first sample are generated using the formulas

IF(RANDBETWEEN(0,1)=0,F$3,-F$3) through IF(RANDBETWEEN(0,1)=0,T$3,-T$3)

and similarly for the other 1,999 samples. For each sample we calculate the median and create a histogram 2,000 median values as shown in Figure 6.

Figure 6 – Analysis for Example 3

The median of the original sample (i.e. the resampling “population”) is MEDIAN(D4:D18) = 3. From Figur see that 10.00% all the samples have a median ≤ -3 and 12.30% have a median ≥ 3. Since 10.00 + 12.

5 of 9 3/15/2020, 5:24 PM Resampling Procedures | Real Statistics Using Excel http://www.real-statistics.com/non-parametric-tests/resampling-procedures/

22.30% ≥ 5% = α, we cannot reject the null hypothesis, and so conclude there is no significant difference b the right and left eye of the population.

Charles says: December 11, 2017 at 3:01 pm

Huyen, Yes, you are correct. I believe that the 2.35% value was left over from a previous version of the data. In any ca

6 of 9 3/15/2020, 5:24 PM