4 Resampling Methods: the Bootstrap • Situation: Let X 1,X2,...,Xn Be A

4 Resampling Methods: The Bootstrap • Situation: Let x1; x2; : : : ; xn be a SRS of size n taken from a distribution that is unknown. Let θ be a parameter of interest associated with this distribution and let θb = S(x1; x2; : : : ; xn) be a statistic used to estimate θ. { For example, the sample mean θb = S(x1; x2; : : : ; xn) = x is a statistic used to estimate the true mean. • Goals: (i) provide a standard error seB(θb) estimate for θb, (ii) estimate the bias of θb, and (iii) generate a confidence interval for the parameter θ. • Bootstrap methods are computer-intensive methods of providing these estimates and depend on bootstrap samples. • An (independent) bootstrap sample is a SRS of size n taken with replacement from the data x1; x2; : : : ; xn. ∗ ∗ ∗ • We denote a bootstrap sample as x1; x2; : : : ; xn which consists of members of the original data set x1; x2; : : : ; xn with some members appearing zero times, some appearing only once, some appearing twice, and so on. • A bootstrap sample replication of θb, denoted θb∗, is the value of θb evaluated using the ∗ ∗ ∗ bootstrap sample x1; x2; : : : ; xn. • The bootstrap algorithm requires that a large number (B) of bootstrap samples be taken. The bootstrap sample replication θb∗ is then calculated for each of the B bootstrap samples. We will denote the bth bootstrap replication as θb∗(b) for b = 1; 2;:::;B. • The notes in this section are based on Efron and Tibshirani (1993) and Manly (2007). 4.1 The Bootstrap Estimate of the Standard Error • The bootstrap estimate of the standard error of θb is v u h i2 uPB θ∗(b) − θ∗(·) t b=1 b b seB(θb) = (7) B − 1 B P θb∗(b) where θb∗(·) = b=1 is the sample mean of the B bootstrap replications. B • Note that seB(θb) is just the sample standard deviation of the B bootstrap replications. • The limit of seB(θb) as B −! 1 is the ideal bootstrap estimate of the standard error. • Under most circumstances, as the sample size n increases, the sampling distribution of θb becomes more normally distributed. Under this assumption, an approximate t-based bootstrap confidence interval can be generated using seB(θb) and a t-distribution: ∗ θb ± t seB(θb) where t∗ has n − 1 degrees of freedom. 52 4.2 The Bootstrap Estimate of Bias • The bias of θb = S(X1;X2;:::;Xn) as an estimator of θ is defined to be bias(θb) = EF (θb) − θ with the expectation taken with respect to distribution F . • The bootstrap estimate of the bias of θb as an estimate of θ is calculated by replacing the distribution F with the empirical cumulative distribution function Fb. This yields B ∗ ∗ 1 X ∗ biasd B(θb) = θb (·) − θb where θb (·) = θb (b): B b=1 • Then, the bias-corrected estimate of θ is ~ ∗ θB = θb− biasd B(θb) = 2θb− θb (·): • One problem with estimating the bias is that the variance of biasd B(θb) is often large. Efron and Tibshirani (1993) recommend using: 1. θb if biasd B(θb) is small relative to the seB(θb). ~ 2. θ if biasd B(θb) is large relative to the seB(θb). They suggest and justify that if biasd B(θb) < :25seB(θb) then the bias can be ignored (unless the goal is precise confidence interval estimation using this standard error). • Manly (2007) suggests that when using bias correction, it is better to center the confidence interval limits using θ~. This would yield the approximate bias-corrected t-based confidence interval: ~ ∗ θ ± t seB(θb) 53 4.3 Introductory Bootstrap Example • Consider a SRS with n = 10 having y-values 0 1 2 3 4 8 8 9 10 11 • The following output is based on B = 40 bootstrap replications of the sample mean x, the sample standard deviation s, the sample variance s2, and the sample median. • The terms in the output are equivalent to the following: theta(hat) = θb = the sample estimate of a parameter mean = the sample mean x s = the sample standard deviation s variance = the sample variance s2 median = the sample median m • These statistics are used as estimates of the population parameters µ, σ, σ2, and M. theta(hat) values for mean, standard deviation, variance, median mean s variance median 5.6000 4.0332 16.2667 6.0000 The number of bootstrap samples B = 40 The bootstrap samples 8 3 2 0 2 11 11 10 8 2 2 9 9 1 9 1 10 4 11 4 8 9 3 8 2 8 3 9 2 2 10 2 11 9 4 8 2 8 3 9 2 8 3 9 8 2 1 8 9 9 10 10 2 11 4 8 8 3 2 3 10 8 0 8 9 10 9 1 1 10 3 10 8 3 2 3 11 9 3 10 10 1 10 8 1 9 10 0 3 8 3 3 3 4 3 0 1 11 11 4 2 3 8 2 10 1 4 8 8 2 2 10 1 3 2 0 10 3 8 4 8 2 11 11 1 9 11 3 8 4 9 11 10 2 8 1 9 10 9 9 8 4 8 4 0 1 3 1 9 11 4 9 8 0 10 10 4 1 8 8 9 11 8 10 3 0 8 10 2 4 8 11 0 3 2 10 1 4 8 8 8 8 3 4 8 0 3 10 11 2 2 1 8 0 2 3 11 11 9 8 2 11 4 2 8 8 8 0 1 10 4 2 0 4 11 0 0 9 1 10 10 2 4 1 4 3 3 10 3 2 1 8 1 8 4 8 3 0 9 10 8 3 9 1 1 8 10 3 10 0 8 11 8 2 1 2 0 11 3 11 0 8 3 8 10 10 8 1 8 9 2 0 3 3 9 11 2 11 3 11 11 1 0 10 9 9 8 2 11 1 2 8 10 1 11 4 8 4 3 2 10 9 1 0 3 2 2 0 2 10 2 8 11 3 0 4 4 4 3 3 3 0 8 2 4 8 1 3 10 4 10 9 8 10 8 10 0 0 11 8 8 11 1 3 3 11 3 0 10 8 8 3 1 4 8 3 8 9 1 1 0 9 11 2 4 10 2 8 10 4 8 11 0 8 0 8 1 3 0 2 9 9 8 11 10 2 11 11 9 1 1 11 11 11 3 0 0 3 0 9 Next, we calculate the sample mean, sample standard deviation, sample variance, and sample ∗ median for each of the bootstrap samples. These represent θbb values. 54 ∗ The column labels below represent bootstrap labels for four different θbb cases: mean = a bootstrap sample mean x∗ b ∗ std dev = a bootstrap sample standard deviation sb variance = a bootstrap sample variance s2∗ ∗b median = a bootstrap sample median mb Bootstrap replications: theta(hat)^*_b mean std dev variance median 5.7000 4.2960 18.4556 5.5000 6.0000 3.9721 15.7778 6.5000 5.4000 3.2042 10.2667 5.5000 6.6000 3.4705 12.0444 8.0000 5.9000 3.4140 11.6556 8.0000 6.1000 3.6347 13.2111 6.0000 6.6000 4.1687 17.3778 8.5000 6.2000 3.6757 13.5111 5.5000 6.0000 4.2164 17.7778 8.0000 4.3000 3.7431 14.0111 3.0000 4.8000 3.3267 11.0667 3.5000 4.3000 3.6833 13.5667 3.0000 6.8000 3.9384 15.5111 8.0000 7.8000 3.4254 11.7333 9.0000 4.9000 3.8427 14.7667 4.0000 6.2000 3.6757 13.5111 8.0000 6.5000 3.8944 15.1667 8.0000 5.5000 3.9511 15.6111 6.0000 5.7000 3.7431 14.0111 6.0000 5.5000 4.3012 18.5000 5.5000 5.4000 4.0332 16.2667 6.0000 4.1000 4.3576 18.9889 3.0000 4.2000 3.1903 10.1778 3.0000 5.2000 3.7947 14.4000 6.0000 5.3000 4.0565 16.4556 5.5000 5.7000 4.5228 20.4556 5.5000 6.5000 3.7193 13.8333 8.0000 5.5000 4.4284 19.6111 3.0000 6.2000 4.5898 21.0667 8.5000 5.3000 3.6225 13.1222 4.0000 3.9000 4.0947 16.7667 2.0000 4.2000 3.1198 9.7333 3.5000 4.3000 3.3015 10.9000 3.5000 7.4000 4.0332 16.2667 8.5000 5.8000 4.2374 17.9556 5.5000 4.6000 3.3066 10.9333 3.5000 6.0000 4.0277 16.2222 6.0000 4.1000 4.2019 17.6556 2.5000 8.1000 3.6347 13.2111 9.0000 4.9000 4.9766 24.7667 3.0000 ∗ ∗ ∗ 2∗ ∗ Take the mean of each column (θb(·)) yielding x(·), s(·), s(·), and m(·). Mean of the B bootstrap replications: theta(hat)^*_(.) mean s variance median 5.5875 3.8707 15.1581 5.6250 ∗ ∗ 2∗ ∗ Take the standard deviation of each column (seB(θb)) yielding seB(x ), seB(s ), seB(s ), and seB(m ). Bootstrap standard error: s.e._B(theta(hat)) mean s variance median 1.0229 0.4248 3.3422 2.1266 ∗ Finally, calculate the estimates of bias Biasd B(θb) = θb(·) − θb for the four cases.

Load more