Bootstrap (Part 3)
Total Page:16
File Type:pdf, Size:1020Kb
Bootstrap (Part 3) Christof Seiler Stanford University, Spring 2016, Stats 205 Overview I So far we used three different bootstraps: I Nonparametric bootstrap on the rows (e.g. regression, PCA with random rows and columns) I Nonparametric bootstrap on the residuals (e.g. regression) I Parametric bootstrap (e.g. PCA with fixed rows and columns) I Today, we will look at some tricks to improve the bootstrap for confidence intervals: I Studentized bootstrap Introduction I A statistics is (asymptotically) pivotal if its limiting distribution does not depend on unknown quantities I For example, with observations X1,..., Xn from a normal distribution with unknown mean and variance, a pivotal quantity is ! √ θ − θˆ T (X ,..., X ) = n 1 n σˆ with unbiased estimates for sample mean and variance 1 n 1 n θˆ = X X σˆ2 = X(X − θˆ)2 n i n − 1 i i=1 i=1 I Then T (X1,..., Xn) is a pivot following the Student’s t-distribution with ν = n − 1 degrees of freedom I Because the distribution of T (X1,..., Xn) does not depend on µ or σ2 Introduction I The bootstrap is better at estimating the distribution of a pivotal statistics than at a nonpivotal statistics I We will see an asymptotic argument using Edgeworth expansions I But first, let us look at an example Motivation I Take n = 20 random exponential variables with mean 3 x = rexp(n,rate=1/3) I Generate B = 1000 bootstrap samples of x, and calculate the mean for each bootstrap sample s = numeric(B) for (j in1:B) { boot = sample(n,replace=TRUE) s[j] = mean(x[boot]) } I Form confidence interval from bootstrap samples using quantiles (α = .025) simple.ci = quantile(s,c(.025,.975)) I Repeat this process 100 times I Check how often the intervals actually contains the true mean Motivation bootstrap conf intervals 100 80 60 40 20 0 0 2 4 6 8 Motivation I Another way is to calculate a pivotal quantity as the bootstrapped statistic I Calculate the mean and standard deviation x = rexp(n,rate=1/3) mean.x = mean(x) sd.x = sd(x) I For each bootstrap sample, calculate z = numeric(B) for (j in1:B) { boot = sample(n,replace=TRUE) z[j] = (mean.x - mean(x[boot]))/sd(x[boot]) } I Form a confidence interval like this pivot.ci = mean.x + sd.x*quantile(z,c(.025,.975)) Motivation bootstrap conf intervals 100 80 60 40 20 0 0 2 4 6 8 Studentized Bootstrap I Consider X1,..., Xn from F I Let θˆ be an estimate of some θ 2 I Let σˆ be a standard error for θˆ estimated using the bootstrap I Most of the time as n grows θˆ − θ ∼. N(0, 1) σˆ (α) I Let z be the 100 · αth percentile of N(0, 1) I Then a standard confidence interval with coverage probability 1 − 2α is θˆ ± z(1−α) · σˆ I As n → ∞, the bootstrap and standard intervals converge Studentized Bootstrap I How can we improve the standard confidence interval? I These intervals are valid under assumption that θˆ − θ Z = ∼. N(0, 1) σˆ I But this is only valid as n → ∞ I And are approximate for finite n I When θˆ is the sample mean, a better approximation is θˆ − θ Z = ∼. t σˆ n−1 and tn−1 is the Student’s t distribution with n − 1 degrees of freedom Studentized Bootstrap I With this new approximation, we have ˆ (1−α) θ ± tn−1 · σˆ I As n grows the t distribution converges to the normal distribution I Intuitively, it widens the interval to account for unknown standard error I But, for instance, it does not account for skewness in the underlying population I This can happen when θˆ is not the sample mean I The Studentized bootstrap can adjust for such errors Studentized Bootstrap I We estimate the distribution of θˆ − θ Z = ∼. ? σˆ ∗1 ∗2 ∗B I by generating B bootstrap samples X , X ,..., X I and computing θˆ∗b − θˆ Z ∗b = σˆ∗b ∗b (α) I Then the αth percentile of Z is estimated by the value ˆt such that #{Z ∗b ≤ ˆt(α)} = α B I Which yields the studentized bootstrap interval (θˆ − ˆt(1−α) · σ,ˆ θˆ − ˆt(α) · σˆ) Asymptotic Argument in Favor of Pivoting ˆ 1 2 I Consider parameter θ estimated by θ with variance n σ I Take the pivotal statistics ! √ θˆ − θ S = n σˆ with estimate θˆ and asymptotic variance estimate σˆ2 I Then, we can use Edgeworth expansions √ √ P(S ≤ x) = Φ(X) + nq(x)φ(x) + O( n) with Φ standard normal distribution, φ standard normal density, and q even polynomials of degree 2 Asymptotic Argument in Favor of Pivoting I Bootstrap estimates are ! √ θˆ∗ − θˆ S = n σˆ∗ I Then, we can use Edgeworth expansions ∗ √ √ P(S ≤ x|X1,..., Xn) = Φ(X) + nqˆ(x)φ(x) + O( n) I qˆ is obtain by replacing unknowns in q with bootstrap estimates I Asymptotically, we further have √ qˆ − q = O( n) Asymptotic Argument in Favor of Pivoting I Then, the bootstrap approximation to the distribution of S is ∗ P(S ≤ x) − P(S ≤ x|X1,..., Xn) = √ √ √ √ Φ(X)+ nq(x)φ(x)+O( n) − Φ(X)+ nqˆ(x)φ(x)+O( n) 1 = O n √ I Compared to the normal approximation n I Which the same as the error when using standard bootstrap (can be shown with the same argument) Studentized Bootstrap I These pivotal intervals are more accurate in large samples than that of standard intervals and t intervals I Accuracy comes at the cost of generality I standard normal tables apply to all samples and all samples sizes I t tables apply to all samples of fixed n I studentized bootstrap tables apply only to given sample I The studentized bootstrap can be asymmetric I It can be used for simple statistics, like mean, median, trimmed mean, and sample percentile I But for more general statistics like the correlation coefficients, there are some problems: I Interval can fall outside of allowable range I Computational issues if both parameter and standard error have to be bootstrapped Studentized Bootstrap I The Studentized bootstrap works better for variance stabilized parameters I Consider a random variable X with mean θ and standard deviation s(θ) that varies as a function of θ I Using the delta method and solving an ordinary differential equation, we can show that Z x 1 g(x) = du s(u) will make the variance of g(X) constant I Usually s(u) is unknown I So we need to estimate s(u) = se(θˆ|θ = u) using the bootstrap Studentized Bootstrap 1. First bootstrap θˆ, second bootstrap seˆ (θˆ) from θˆ∗ 2. Fit curve through points (θˆ∗1, seˆ (θˆ∗1)),..., (θˆ∗B, seˆ (θˆ∗B)) 3. Variance stabilization g(θˆ) by numerical integration 4. Studentized bootstrap using g(θˆ∗) − g(θˆ) (no denominator, since variance is now approximately one) 5. Map back through transformation g −1 Source: Efron and Tibshirani (1994) Studentized Bootstrap in R library(boot) mean.fun = function(d, i) { m = mean(d$hours[i]) n = length(i) v = (n-1)*var(d$hours[i])/n^2 c(m, v) } air.boot <- boot(aircondit, mean.fun,R= 999) results = boot.ci(air.boot, type = c("basic", "stud")) Studentized Bootstrap in R results ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 999 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = air.boot, type = c("basic", "stud")) ## ## Intervals : ## Level Basic Studentized ## 95% ( 22.2, 171.2 ) ( 49.0, 303.0 ) ## Calculations and Intervals on Original Scale References I Efron (1987). Better Bootstrap Confidence Intervals I Hall (1992). The Bootstrap and Edgeworth Expansion I Efron and Tibshirani (1994). An Introduction to the Bootstrap I Love (2010). Bootstrap-t Confidence Intervals (Link to blog entry).