Bootstrap Resampling

Bootstrap Resampling Nathaniel E. Helwig Assistant Professor of Psychology and Statistics University of Minnesota (Twin Cities) Updated 04-Jan-2017 Nathaniel E. Helwig (U of Minnesota) Bootstrap Resampling Updated 04-Jan-2017 : Slide 1 Copyright Copyright c 2017 by Nathaniel E. Helwig Nathaniel E. Helwig (U of Minnesota) Bootstrap Resampling Updated 04-Jan-2017 : Slide 2 Outline of Notes 1) Background Information 3) Bootstrap in Practice Statistical inference Bootstrap in R Sampling distributions Bias and mean-squared error Need for resampling The Jackknife 2) Bootstrap Basics 4) Bootstrapping Regression Overview Regression review Empirical distribution Bootstrapping residuals Plug-in principle Bootstrapping pairs For a thorough treatment see: Efron, B. & Tibshirani, R.J. (1993). An Introduction to the Bootstrap. Chapman & Hall/CRC. Nathaniel E. Helwig (U of Minnesota) Bootstrap Resampling Updated 04-Jan-2017 : Slide 3 Background Information Background Information Nathaniel E. Helwig (U of Minnesota) Bootstrap Resampling Updated 04-Jan-2017 : Slide 4 Background Information Statistical Inference The Classic Statistical Paradigm X is some random variable, e.g., age in years X = fx1; x2; x3;:::g is some population of interest, e.g., Ages of all students at the University of Minnesota Ages of all people in the state of Minnesota At the population level. F(x) = P(X ≤ x) for all x 2 X is the population CDF θ = t(F) is population parameter, where t is some function of F At the sample level. 0 iid x = (x1;:::; xn) is sample of data with xi ∼ F for i 2 f1;:::; ng θ^ = s(x) is sample statistic, where s is some function of x Nathaniel E. Helwig (U of Minnesota) Bootstrap Resampling Updated 04-Jan-2017 : Slide 5 Background Information Statistical Inference The Classic Statistical Paradigm (continued) θ^ is a random variable that depends on x (and thus F) The sampling distribution of θ^ refers to the CDF (or PDF) of θ^. If F is known (or assumed to be known), then the sampling distribution of θ^ may have some known distribution. iid 2 ¯ 2 ¯ 1 Pn If xi ∼ N(µ, σ ), then x ∼ N(µ, σ =n) where x = n i=1 xi Note in the above example, θ ≡ µ and θ^ ≡ x¯ How can we make inferences about θ using θ^ when F is unknown? Nathaniel E. Helwig (U of Minnesota) Bootstrap Resampling Updated 04-Jan-2017 : Slide 6 Background Information Sampling Distributions The Hypothetical Ideal Assume that X is too large to measure all members of population. If we had a really LARGE research budget, we could collect B independent samples from the population X 0 iid xj = (x1j ;:::; xnj ) is j-th sample with xij ∼ F ^ θj = s(xj ) is statistic (parameter estimate) for j-th sample ^ ^ B Sampling distribution of θ can be estimated via distribution of fθj gj=1. Nathaniel E. Helwig (U of Minnesota) Bootstrap Resampling Updated 04-Jan-2017 : Slide 7 Background Information Sampling Distributions The Hypothetical Ideal: Example 1 (Normal Mean) iid Sampling Distribution of x¯ with xi ∼ N(0; 1) for n = 100: Sampling Distribution: B = 200 Sampling Distribution: B = 500 Sampling Distribution: B = 1000 5 x pdf 5 x pdf 5 x pdf 4 4 4 3 3 3 Density Density Density 2 2 2 1 1 1 0 0 0 −0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4 xbar xbar xbar Sampling Distribution: B = 2000 Sampling Distribution: B = 5000 Sampling Distribution: B = 10000 5 x pdf 5 x pdf 5 x pdf 4 4 4 3 3 3 Density Density Density 2 2 2 1 1 1 0 0 0 −0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4 xbar xbar xbar Nathaniel E. Helwig (U of Minnesota) Bootstrap Resampling Updated 04-Jan-2017 : Slide 8 Background Information Sampling Distributions The Hypothetical Ideal: Example 1 R Code # hypothetical ideal: example 1 (normal mean) set.seed(1) n = 100 B = c(200,500,1000,2000,5000,10000) xseq = seq(-0.4,0.4,length=200) quartz(width=12,height=8) par(mfrow=c(2,3)) for(k in 1:6){ X = replicate(B[k], rnorm(n)) xbar = apply(X, 2, mean) hist(xbar, freq=F, xlim=c(-0.4,0.4), ylim=c(0,5), main=paste("Sampling Distribution: B =",B[k])) lines(xseq, dnorm(xseq, sd=1/sqrt(n))) legend("topright",expression(bar(x)*" pdf"),lty=1,bty="n") } Nathaniel E. Helwig (U of Minnesota) Bootstrap Resampling Updated 04-Jan-2017 : Slide 9 Background Information Sampling Distributions The Hypothetical Ideal: Example 2 (Normal Median) iid Sampling Distribution of median(x) with xi ∼ N(0; 1) for n = 100: Sampling Distribution: B = 200 Sampling Distribution: B = 500 Sampling Distribution: B = 1000 5 x pdf 5 x pdf 5 x pdf 4 4 4 3 3 3 Density Density Density 2 2 2 1 1 1 0 0 0 −0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4 xmed xmed xmed Sampling Distribution: B = 2000 Sampling Distribution: B = 5000 Sampling Distribution: B = 10000 5 x pdf 5 x pdf 5 x pdf 4 4 4 3 3 3 Density Density Density 2 2 2 1 1 1 0 0 0 −0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4 xmed xmed xmed Nathaniel E. Helwig (U of Minnesota) Bootstrap Resampling Updated 04-Jan-2017 : Slide 10 Background Information Sampling Distributions The Hypothetical Ideal: Example 2 R Code # hypothetical ideal: example 2 (normal median) set.seed(1) n = 100 B = c(200,500,1000,2000,5000,10000) xseq = seq(-0.4,0.4,length=200) quartz(width=12,height=8) par(mfrow=c(2,3)) for(k in 1:6){ X = replicate(B[k], rnorm(n)) xmed = apply(X, 2, median) hist(xmed, freq=F, xlim=c(-0.4,0.4), ylim=c(0,5), main=paste("Sampling Distribution: B =",B[k])) lines(xseq, dnorm(xseq, sd=1/sqrt(n))) legend("topright",expression(bar(x)*" pdf"),lty=1,bty="n") } Nathaniel E. Helwig (U of Minnesota) Bootstrap Resampling Updated 04-Jan-2017 : Slide 11 Background Information Need for Resampling Back to the Real World In most cases, we only have one sample of data. What do we do? If n is large and we only care about x¯, we can use the CLT. iid Sampling Distribution of x¯ with xi ∼ U[0; 1] for B = 10000: Sampling Distribution: n = 3 Sampling Distribution: n = 5 Sampling Distribution: n = 10 2.5 asymp pdf asymp pdf asymp pdf 3.0 4 2.0 2.5 3 2.0 1.5 1.5 2 Density Density Density 1.0 1.0 1 0.5 0.5 0 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 xbar xbar xbar Sampling Distribution: n = 20 Sampling Distribution: n = 50 Sampling Distribution: n = 100 10 asymp pdf asymp pdf 14 asymp pdf 6 12 8 5 10 4 6 8 3 Density Density Density 6 4 2 4 2 1 2 0 0 0 0.3 0.4 0.5 0.6 0.7 0.8 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.40 0.45 0.50 0.55 0.60 xbar xbar xbar Nathaniel E. Helwig (U of Minnesota) Bootstrap Resampling Updated 04-Jan-2017 : Slide 12 Background Information Need for Resampling The Need for a Nonparametric Resampling Method For most statistics other than the sample mean, there is no theoretical argument to derive the sampling distribution. To make inferences, we need to somehow obtain (or approximate) the sampling distribution of any generic statistic θ^. Note that parametric statistics overcome this issue by assuming some particular distribution for the data Nonparametric bootstrap overcomes this problem by resampling observed data to approximate the sampling distribution of θ^. Nathaniel E. Helwig (U of Minnesota) Bootstrap Resampling Updated 04-Jan-2017 : Slide 13 Bootstrap Basics Bootstrap Basics Nathaniel E. Helwig (U of Minnesota) Bootstrap Resampling Updated 04-Jan-2017 : Slide 14 Bootstrap Basics Overview Problem of Interest In statistics, we typically want to know the properties of our estimates, e.g., precision, accuracy, etc. In parametric situation, we can often derive the distribution of our estimate given our assumptions about the data (or MLE principles). In nonparametric situation, we can use the bootstrap to examine properties of our estimates in a variety of different situations. Nathaniel E. Helwig (U of Minnesota) Bootstrap Resampling Updated 04-Jan-2017 : Slide 15 Bootstrap Basics Overview Bootstrap Procedure 0 iid Suppose x = (x1;:::; xn) with xi ∼ F(x) for i 2 f1;:::; ng, and we want to make inferences about some statistic θ^ = s(x). Can use Monte Carlo Bootstrap: 1 ∗ Sample xi with replacement from fx1;:::; xng for i 2 f1;:::; ng 2 ^∗ ∗ ∗ ∗ ∗ 0 Calculate θ = s(x ) for b-th sample where x = (x1 ;:::; xn ) 3 Repeat 1–2 a total of B times to get bootstrap distribution of θ^ 4 Compare θ^ = s(x) to bootstrap distribution ^ ^∗ B Estimated standard error of θ is standard deviation of fθbgb=1: r 1 σ^ = PB (θ^∗ − θ¯∗)2 B B − 1 b=1 b ¯∗ 1 PB ^∗ ^ where θ = B b=1 θb is the mean of the bootstrap distribution of θ.

Load more