Introduction to Resampling Methods Using R

Introduction to Resampling Methods Using R Contents 1 Sampling from known distributions and simulation 1.1 Sampling from normal distributions 1.2 Specifying seeds 1.3 Sampling from exponential distributions 2 Bootstrapping 2.1 Bootstrap distributions 2.2 Bootstrap confidence intervals 2.2.1 Percentile method 2.2.2 Pivot method 2.2.3 Standard bootstrap 3 Randomization tests 3.1 Creating random permutations 3.2 Comparing groups 3.2.1 Exact randomization distribution 3.2.2 Random sampling the randomization distribution 3.2.3 Choice of test statistic 3.3 Wilcoxon rank-sum test 3.4 Selecting among two-sample tests 3.5 More than two groups 3.6 Contingency tables 4 Methods for correlation and regression 4.1 Randomization test for linear relation 4.1.1 Pearson correlation or slope of regression line 4.1.2 Rank correlation 4.2 Bootstrap intervals for correlation and slope 4.2.1 Bivariate bootstrap sampling 4.2.2 Confidence intervals 4.2.3 Fixed-X sampling for the slope 5 Two-sample bootstrap intervals 1 1. Sampling from known distributions and simulation In introductory statistics courses we are told that the t-test is “robust” to departures from normality, especially if the sample size is large. What this means is if we specify a particular Type I error rate, then the actual proportion of false rejections will be close to the Type I error rate. Let’s create and run a simulation to explore this. Steps. 1. Generate a random sample from some population distribution 2. Calculate sample mean, standard deviation and t test statistic 3. Decide if the null hypothesis is rejected 4. Repeat 1-3, counting the number of rejections 1.1. Sampling from normal distributions counter <‐ 0 # set counter to 0 t.crit <‐ qt(0.95,14) #5% critical value for (i in 1:1000) { x <‐ rnorm(15, 25, 4) # draw a random sample of size 15 from a N(25,4) distribution t <‐ (mean(x)‐25)*sqrt(15)/sd(x) if (t >= t.crit) # check to see if result is significant counter <‐ counter + 1 # increase counter by 1 } counter/1000 #compute estimate of Type I error rate ## [1] 0.06 If we execute this code again, a different set of random samples will be selected, and a different estimate obtained counter <‐ 0 # set counter to 0 t.crit <‐ qt(0.95,14) #5% critical value for (i in 1:1000) { x <‐ rnorm(15, 25, 4) # draw a random sample of size 15 from a N(25,4) distribution t <‐ (mean(x)‐25)*sqrt(15)/sd(x) if (t >= t.crit) # check to see if result is significant 2 counter <‐ counter + 1 # increase counter by 1 } counter/1000 #compute estimate of Type I error rate ## [1] 0.043 1.2 Specify a seed to get identical results each time. set.seed(4123) counter <‐ 0 # set counter to 0 t.crit <‐ qt(0.95,14) #5% critical value for (i in 1:1000) { x <‐ rnorm(15, 25, 4) # draw a random sample of size 15 from a N(25,4) distribution t <‐ (mean(x)‐25)*sqrt(15)/sd(x) if (t >= t.crit) # check to see if result is significant counter <‐ counter + 1 # increase counter by 1 } counter/1000 #compute estimate of Type I error rate ## [1] 0.043 Execute the same code again: set.seed(4123) counter <‐ 0 # set counter to 0 t.crit <‐ qt(0.95,14) #5% critical value for (i in 1:1000) { x <‐ rnorm(15, 25, 4) # draw a random sample of size 15 from a N(25,4) distribution t <‐ (mean(x)‐25)*sqrt(15)/sd(x) if (t >= t.crit) # check to see if result is significant counter <‐ counter + 1 # increase counter by 1 } counter/1000 #compute estimate of Type I error rate ## [1] 0.043 3 Instead of using a counter, we may want to store the results so they can be explored later. In the code below, a vector is created and used to store the calculated t-statistics. set.seed(4123) nsims <‐ 1000 t.crit <‐ qt(0.95,14) #5% critical value results <‐ numeric(nsims) #Vector to store t statistics for (i in 1:nsims) { x <‐ rnorm(15, mean=0, sd=1) # draw a random sample of size 15 from a N(25,4) distribution results[i] <‐ (mean(x)‐0)*sqrt(15)/sd(x) } sum(results >= t.crit)/nsims #compute estimate of error rate ## [1] 0.043 Having the results saved in a vector allows us to explore the actual sampling distribution. Below we graphically assess agreement between theoretical and actual distributions. hist(results, freq = F, ylim=c(0,0.4)) # Plot histogram of t statistics curve(dt(x,14), add = TRUE) # superimpose t(14) density 4 1.3 Sampling from an exponential distributions set.seed(4123) nsims <‐ 1000 t.crit <‐ qt(0.95,14) #5% critical value results <‐ numeric(nsims) #Vector to store t statistics for (i in 1:nsims) { x <‐ rexp(15, rate=1/25) # draw a random sample of size 15 from an Exp(mean=25) distribution results[i] <‐ (mean(x)‐25)*sqrt(15)/sd(x) } sum(results >= t.crit)/nsims #compute estimate of error rate ## [1] 0.015 Graphically assess agreement between theoretical and actual distributions. hist(results, freq = F, xlim=c(‐6,4), ylim=c(0,0.4)) # Plot histogram of t statistics curve(dt(x,14), add = TRUE) # superimpose t(14) density 5 Available distributions (http://www.stat.umn.edu/geyer/old/5101/rlook.html) Distribution Functions Beta pbeta qbeta dbeta rbeta Binomial pbinom qbinom dbinom rbinom Cauchy pcauchy qcauchy dcauchy rcauchy Chi-Square pchisq qchisq dchisq rchisq Exponential pexp qexp dexp rexp F pf qf df rf Gamma pgamma qgamma dgamma rgamma Geometric pgeom qgeom dgeom rgeom Hypergeometric phyper qhyper dhyper rhyper Logistic plogis qlogis dlogis rlogis Log Normal plnorm qlnorm dlnorm rlnorm Negative Binomial pnbinom qnbinom dnbinom rnbinom Normal pnorm qnorm dnorm rnorm Poisson ppois qpois dpois rpois Student t pt qt dt rt Studentized Range ptukey qtukey dtukey rtukey Uniform punif qunif dunif runif 6 Weibull pweibull qweibull dweibull rweibull Wilcoxon Rank Sum Statistic pwilcox qwilcox dwilcox rwilcox Wilcoxon Signed Rank Statistic psignrank qsignrank dsignrank rsignrank 7 2. Bootstrap Confidence intervals Suppose we want to estimate a population parameter, based on random sample. - Classical World: Observe one sample and the value of the sample statistic. Sampling distribution is determined by considering all possible (unobserved) samples from the same assumed population. Cannot directly observe the sampling distribution. - Bootstrap World: Rather than assume a population, consider the observed sample to be the best estimate of the population. In fact, we will assume that it represents the probability distribution for the population. We can then generate all (or at least very many) possible samples by taking bootstrap samples (with replacement), from this estimated population and thus observe the sampling distribution of the sample estimator. 2.1 Drawing bootstrap samples using R. We start with a very small data set, a set of new employee test scores: 23, 31, 37, 46, 49, 55, 57 First select a sample of size 7, with replacement and compute the mean of the bootstrap sample. score <‐ c(37,49,55,57,23,31,46) mean <‐ mean(score) mean ## [1] 42.57143 boot <‐ sample(score, size=7, replace=TRUE) boot ## [1] 31 37 31 31 31 37 31 mean.boot <‐ mean(boot) mean.boot ## [1] 32.71429 8 We need to do this many times to estimate the sampling distribution of the mean. score <‐ c(37,49,55,57,23,31,46) mean <‐ mean(score) mean ## [1] 42.57143 N <‐ length(score) nboots <‐ 10000 boot.result <‐ numeric(nboots) for(i in 1:nboots) { boot.samp <‐ sample(score, N, replace=TRUE) boot.result[i] <‐ mean(boot.samp) } hist(boot.result) 9 2.2 Bootstrap confidence intervals Example. Suppose we have a random sample of size 30 from an exponential distribution with mean 25. We want to use the sample mean to estimate the population mean. We will discuss three ways to construct confidence intervals using bootstrapping. 2.2.1 Percentile method If the estimator of the population parameter is the statistic used to create the distribution (e.g., the sample mean), then the confidence interval is simply the equal-tail quantiles that correspond to the confidence level. Bootstrap 95% percentile confidence interval set.seed(4123) x.exp <‐ rexp(30, rate=1/25) x.exp ## [1] 2.528853 2.235845 40.423011 5.255557 3.355874 2.724010 ## [7] 10.787030 9.792154 31.882324 15.816492 13.925713 38.726646 ## [13] 19.283214 70.833128 30.556819 31.620638 64.401698 34.713028 ## [19] 6.153141 81.498008 16.034828 80.867315 41.204157 77.066567 ## [25] 14.237094 47.647705 35.505526 13.203980 104.795832 11.897545 boxplot(x.exp) 10 n <‐ length(x.exp) mean.exp <‐ mean(x.exp) nboots <‐ 10000 boot.result <‐ numeric(nboots) for(i in 1:nboots) { boot.samp <‐ sample(x.exp, n, replace=TRUE) boot.result[i] <‐ mean(boot.samp) } hist(boot.result) 11 mean.exp ## [1] 31.96579 quantile(boot.result, c(0.025,0.975)) ## 2.5% 97.5% ## 22.49355 42.19327 2.2.2 Pivot method A pivot quantity is a function of the estimator whose distribution does not depend on the parameter being estimated. Example: Estimating the population mean, based on the sample mean, Y . Then the Y statistic Ttn~( 1) has a Student’s t distribution with n-1 degrees of freedom. Sn/ Because the distribution of T does not depend on , T is a pivot quantity. 12 When such a quantity exists, we can then use bootstrapping to estimate the distribution of the pivot quantity—essentially a custom table—and use quantiles from the table to create the confidence interval.

Load more