Bootstrap (Part 3)

Bootstrap (Part 3) Christof Seiler Stanford University, Spring 2016, Stats 205 Overview I So far we used three different bootstraps: I Nonparametric bootstrap on the rows (e.g. regression, PCA with random rows and columns) I Nonparametric bootstrap on the residuals (e.g. regression) I Parametric bootstrap (e.g. PCA with fixed rows and columns) I Today, we will look at some tricks to improve the bootstrap for confidence intervals: I Studentized bootstrap Introduction I A statistics is (asymptotically) pivotal if its limiting distribution does not depend on unknown quantities I For example, with observations X1,..., Xn from a normal distribution with unknown mean and variance, a pivotal quantity is ! √ θ − θˆ T (X ,..., X ) = n 1 n σˆ with unbiased estimates for sample mean and variance 1 n 1 n θˆ = X X σˆ2 = X(X − θˆ)2 n i n − 1 i i=1 i=1 I Then T (X1,..., Xn) is a pivot following the Student’s t-distribution with ν = n − 1 degrees of freedom I Because the distribution of T (X1,..., Xn) does not depend on µ or σ2 Introduction I The bootstrap is better at estimating the distribution of a pivotal statistics than at a nonpivotal statistics I We will see an asymptotic argument using Edgeworth expansions I But first, let us look at an example Motivation I Take n = 20 random exponential variables with mean 3 x = rexp(n,rate=1/3) I Generate B = 1000 bootstrap samples of x, and calculate the mean for each bootstrap sample s = numeric(B) for (j in1:B) { boot = sample(n,replace=TRUE) s[j] = mean(x[boot]) } I Form confidence interval from bootstrap samples using quantiles (α = .025) simple.ci = quantile(s,c(.025,.975)) I Repeat this process 100 times I Check how often the intervals actually contains the true mean Motivation bootstrap conf intervals 100 80 60 40 20 0 0 2 4 6 8 Motivation I Another way is to calculate a pivotal quantity as the bootstrapped statistic I Calculate the mean and standard deviation x = rexp(n,rate=1/3) mean.x = mean(x) sd.x = sd(x) I For each bootstrap sample, calculate z = numeric(B) for (j in1:B) { boot = sample(n,replace=TRUE) z[j] = (mean.x - mean(x[boot]))/sd(x[boot]) } I Form a confidence interval like this pivot.ci = mean.x + sd.x*quantile(z,c(.025,.975)) Motivation bootstrap conf intervals 100 80 60 40 20 0 0 2 4 6 8 Studentized Bootstrap I Consider X1,..., Xn from F I Let θˆ be an estimate of some θ 2 I Let σˆ be a standard error for θˆ estimated using the bootstrap I Most of the time as n grows θˆ − θ ∼. N(0, 1) σˆ (α) I Let z be the 100 · αth percentile of N(0, 1) I Then a standard confidence interval with coverage probability 1 − 2α is θˆ ± z(1−α) · σˆ I As n → ∞, the bootstrap and standard intervals converge Studentized Bootstrap I How can we improve the standard confidence interval? I These intervals are valid under assumption that θˆ − θ Z = ∼. N(0, 1) σˆ I But this is only valid as n → ∞ I And are approximate for finite n I When θˆ is the sample mean, a better approximation is θˆ − θ Z = ∼. t σˆ n−1 and tn−1 is the Student’s t distribution with n − 1 degrees of freedom Studentized Bootstrap I With this new approximation, we have ˆ (1−α) θ ± tn−1 · σˆ I As n grows the t distribution converges to the normal distribution I Intuitively, it widens the interval to account for unknown standard error I But, for instance, it does not account for skewness in the underlying population I This can happen when θˆ is not the sample mean I The Studentized bootstrap can adjust for such errors Studentized Bootstrap I We estimate the distribution of θˆ − θ Z = ∼. ? σˆ ∗1 ∗2 ∗B I by generating B bootstrap samples X , X ,..., X I and computing θˆ∗b − θˆ Z ∗b = σˆ∗b ∗b (α) I Then the αth percentile of Z is estimated by the value ˆt such that #{Z ∗b ≤ ˆt(α)} = α B I Which yields the studentized bootstrap interval (θˆ − ˆt(1−α) · σ,ˆ θˆ − ˆt(α) · σˆ) Asymptotic Argument in Favor of Pivoting ˆ 1 2 I Consider parameter θ estimated by θ with variance n σ I Take the pivotal statistics ! √ θˆ − θ S = n σˆ with estimate θˆ and asymptotic variance estimate σˆ2 I Then, we can use Edgeworth expansions √ √ P(S ≤ x) = Φ(X) + nq(x)φ(x) + O( n) with Φ standard normal distribution, φ standard normal density, and q even polynomials of degree 2 Asymptotic Argument in Favor of Pivoting I Bootstrap estimates are ! √ θˆ∗ − θˆ S = n σˆ∗ I Then, we can use Edgeworth expansions ∗ √ √ P(S ≤ x|X1,..., Xn) = Φ(X) + nqˆ(x)φ(x) + O( n) I qˆ is obtain by replacing unknowns in q with bootstrap estimates I Asymptotically, we further have √ qˆ − q = O( n) Asymptotic Argument in Favor of Pivoting I Then, the bootstrap approximation to the distribution of S is ∗ P(S ≤ x) − P(S ≤ x|X1,..., Xn) = √ √ √ √ Φ(X)+ nq(x)φ(x)+O( n) − Φ(X)+ nqˆ(x)φ(x)+O( n) 1 = O n √ I Compared to the normal approximation n I Which the same as the error when using standard bootstrap (can be shown with the same argument) Studentized Bootstrap I These pivotal intervals are more accurate in large samples than that of standard intervals and t intervals I Accuracy comes at the cost of generality I standard normal tables apply to all samples and all samples sizes I t tables apply to all samples of fixed n I studentized bootstrap tables apply only to given sample I The studentized bootstrap can be asymmetric I It can be used for simple statistics, like mean, median, trimmed mean, and sample percentile I But for more general statistics like the correlation coefficients, there are some problems: I Interval can fall outside of allowable range I Computational issues if both parameter and standard error have to be bootstrapped Studentized Bootstrap I The Studentized bootstrap works better for variance stabilized parameters I Consider a random variable X with mean θ and standard deviation s(θ) that varies as a function of θ I Using the delta method and solving an ordinary differential equation, we can show that Z x 1 g(x) = du s(u) will make the variance of g(X) constant I Usually s(u) is unknown I So we need to estimate s(u) = se(θˆ|θ = u) using the bootstrap Studentized Bootstrap 1. First bootstrap θˆ, second bootstrap seˆ (θˆ) from θˆ∗ 2. Fit curve through points (θˆ∗1, seˆ (θˆ∗1)),..., (θˆ∗B, seˆ (θˆ∗B)) 3. Variance stabilization g(θˆ) by numerical integration 4. Studentized bootstrap using g(θˆ∗) − g(θˆ) (no denominator, since variance is now approximately one) 5. Map back through transformation g −1 Source: Efron and Tibshirani (1994) Studentized Bootstrap in R library(boot) mean.fun = function(d, i) { m = mean(d$hours[i]) n = length(i) v = (n-1)*var(d$hours[i])/n^2 c(m, v) } air.boot <- boot(aircondit, mean.fun,R= 999) results = boot.ci(air.boot, type = c("basic", "stud")) Studentized Bootstrap in R results ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 999 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = air.boot, type = c("basic", "stud")) ## ## Intervals : ## Level Basic Studentized ## 95% ( 22.2, 171.2 ) ( 49.0, 303.0 ) ## Calculations and Intervals on Original Scale References I Efron (1987). Better Bootstrap Confidence Intervals I Hall (1992). The Bootstrap and Edgeworth Expansion I Efron and Tibshirani (1994). An Introduction to the Bootstrap I Love (2010). Bootstrap-t Confidence Intervals (Link to blog entry).

Bootstrap (Part 3)

STATS 305 Notes1

Pivotal Quantities with Arbitrary Small Skewness Arxiv:1605.05985V1 [Stat

Stat 3701 Lecture Notes: Bootstrap Charles J

Interval Estimation Statistics (OA3102)

Elements of Statistics (MATH0487-1)

Confidence Intervals for a Two-Parameter Exponential Distribution: One- and Two-Sample Problems

1. Preface 2. Introduction 3. Sampling Distribution

Confidence Intervals and Nuisance Parameters Common Example

On Interval Estimation for Exponential Power Distribution Parameters

Comparison of Efficiencies of Symmetry Tests Around Unknown

Median Confidence Regions in a Nonparametric Model

Confidence Interval