Goals of the Lecture

Regression III Lecture 5: Techniques

Introduce the general idea of Resampling - techniques to • Dave Armstrong resample from the original Bootstrapping • Cross-validation University of Western Ontario • Department of Political Science An extended example of bootstrapping local polynomial Department of and (by courtesy) • regression models. e: [email protected] w: www.quantoid.net/teachicpsr/regression3/

1/64 2/64

Resampling: An Overview Resampling and Regression: A Caution

Resampling techniques sample from the original dataset There is no need whatsoever for bootstrapping in regression • • Some of the applications of these methods are: analysis if the OLS assumptions are met • In such cases, OLS estimates are unbiased and maximally ecient. to compute standard errors and confidence intervals (either when • • we have small sample sizes, dubious distributional assumptions or There are situations, however, where we cannot satisfy the • for a that does not have an easily derivable asymptotic assumptions and thus other methods are more helpful ) (such as MM-estimation) often provides better Subset selection in regression • estimates than OLS in the presence of influential cases, but only • Handling has reliable SEs asymptotically. • Selection of degrees of freedom in Local is often “better” (in the RSS sense) in • (especially GAMs) • the presence of non-linearity, but because of the unknown df, only For the most part, this lecture will discuss resampling techniques has approximate distribution. • in the context of computing confidence intervals and hypothesis Cross-validation is particularly helpful for validating models and • tests for . choosing model fitting parameters

3/64 4/64 Bootstrapping: General Overview Bootstrapping: General Overview (2)

If we assume that a X or statistic has a Assume that we have a sample of size n for which we require • particular population value, we can study how a statistical • more reliable standard errors for our estimates computed from samples behaves Perhaps n is small, or alternatively, we have a statistic for which • We don’t always know, however, how a variable or statistic is there is no known • distributed in the population The bootstrap provides one “solution” • For example, there may be a statistic for which standard errors Take several new samples from the original sample, calculating the • have not been formulated (e.g., imagine we wanted to test whether • statistic each time two additive scales have significantly di↵erent levels of internal Calculate the average and standard error (and maybe quantiles) consistency - Cronbach’s doesn’t have an exact sampling • from the empirical distribution of the bootstrap samples distribution In other words, we find a standard error based on sampling (with Another example is the impact of missing data on a distribution - • replacement) from the original data • we don’t know how the missing data di↵er from the observed data We apply principles of inference similar to those employed when Bootstrapping is a technique for estimating standard errors and • sampling from the population • The population is to the sample as the sample is to the bootstrap confidence intervals (sets) without making assumptions about the • distributions that give rise to the data samples

5/64 6/64

Bootstrapping: General Overview (3) Bootstrapping the Imagine, unrealistically, that we are interested in finding the • confidence interval for the mean of a sample of only 4 observations Specifically, assume that we are interested in the di↵erence in there are several Variants of the bootstrap. The two we are most • income between husbands and wives interested in are: We have four cases, with the following mean di↵erences (in 1. Nonparametric Bootstrap • $ 1000’s): 6, -3, 5, 3, for a mean of 2.75 and a No underlying population distribution is assumed of 4.031 • Most commonly used method From classical theory, we can calculate the standard error: • • 2. Parametric Bootstrap Sx Assumes that the statistic has a particular parametric form (e.g., SE = • normal) pn 4.031129 = p4 = 2.015564 Now we’ll compare the confidence interval to the one calculated • using bootstrapping

7/64 8/64 Defining the Random Variable The Sample as the Population (1)

We now treat the sample as if it were the population, and The first thing that bootstrapping does is estimate the • • population distribution of Y from the four observations in the resample from it sample In this case, we take all possible samples with replacement, • meaning that we take nn = 256 di↵erent samples In other words, the random variable Y ⇤ is defined: • Since we found all possible samples, the mean of these samples is Y ⇤ p⇤ Y ⇤ • 6 0.25( ) simply the original mean -3 0.25 We then determine the standard error of Y¯ from these samples • 5 0.25 nn 2 3 0.25 Y¯⇤ Y¯ SE Y¯ = b=1( b ) = 1.74553 The mean of Y ⇤ is then simply the mean of the sample: ⇤ s n • ( ) Õ n We now adjust for the sample size E⇤ Y ⇤ = Y ⇤p⇤ Y ⇤ ( ) ( ) = ’2.75 n 4 ¯ SEˆ Y¯ = SE⇤ Y¯⇤ = 1.74553 = 2.015564 = Y ( ) n 1 ( ) 3 ⇥ r r

9/64 10 / 64

The Sample as the Population (2) Characteristics of the Bootstrap Statistic The bootstrap sampling distribution around the original estimate • of the statistic T is analogous to the sampling distribution of T In this example, because we used all possible resamples of our around the population parameter • The average of the bootstrapped statistics is simply: sample, the bootstrap standard error (2.015564) is exactly the • same as the original standard error R T ⇤ This approach can be used for statistics for which we do not have T¯ = E T b=1 b • ⇤ ⇤ standard error formulas, or we have small sample sizes ( ) ⇡ Õ R In summary, the following analogies can be made to sampling where R is the number of bootstraps • from the population The bias of T can be seen as its deviation from the bootstrap Bootstrap observations original observations • average (i.e., it estimates T ) • Bootstrap Mean original! sample mean • ! Original sample mean unknown population mean µ Bˆ⇤ = T¯⇤ T • Distribution of the bootstrap! unknown sampling • distribution from the original sample! The estimated bootstrap of T is: • ⇤ R T T¯ 2 b=1 b⇤ ⇤ Vˆ T ⇤ = ( ) ( ) R 1 Õ 11 / 64 12 / 64 Bootstrapping with Larger Samples Evaluating Confidence Intervals

Accuracy: The larger the sample, the more e↵ort it is to calculate the how quickly do coverage errors go to zero? • bootstrap estimates • Prob < Tˆ = and Prob > Tˆ = With large sample sizes, the possible number of bootstrap samples • { lo} { up} • n Errors go to zero at a rate of: n gets very large and impractical (e.g., it would take a long time • to calculate 10001000 bootstrap samples) 1 (second-order accurate) • n typically we want to take somewhere between 1000 and 2000 1 (first-order accurate) • bootstrap samples in order to find a confidence interval of a • pn statistic Transformation Respecting: After calculating the standard error, we can easily find the For any monotone transformation of , = m , can we obtain • • ( ) confidence interval. Three methods are commonly used the right confidence interval on ˆ with the confidence intervals on 1. Normal Theory Intervals ˆ mapped by m ? E.g., 2. Intervals () 3. Bias Corrected Percentile Intervals ˆ , ˆ = m ˆ ,m ˆ [ lo up] [ ( lo) ( up)]

13 / 64 14 / 64

Bootstrap Confidence Intervals: Normal Theory Intervals Bootstrap Confidence Intervals: Percentile Intervals

Uses of the bootstrap sampling distribution to find • the end-points of the confidence interval Many statistics are asymptotically normally distributed ˆ • If G is the CDF of T ⇤, then we can find the 100(1-)% confidence Therefore, in large samples, we may be able to use a normality • interval with: • assumption to characterize the bootstrap distribution. E.g., 1 1 Tˆ , ,Tˆ , = Gˆ ,Gˆ 1 2 % lo % up Tˆ⇤ N Tˆ,seˆ ( ) ( ) ⇠ ( ) The 1 2 percentile interval can be approximated with: • ( ) ⇥ ⇤ ⇥ ⇤ ˆ 1 where seˆ is V T ⇤ ˆ ˆ ⇤( ) ⇤( ) ( ) T%,lo,T%,up TB ,TB This approachq works well for the bootstrap confidence interval, ⇡ • h i but only if the bootstrap sampling distribution is approximately ⇥1 ⇤ where T ⇤( ) and T ⇤( ) are the ordered (B) bootstrap replicates normally distributed B B such that 100% of them fall below the former and 100% of In other words, it is important to look at the distribution before • relying on the normal theory interval them fall above the latter. These intervals do not assume a normal distribution, but they do • not perform well unless we have a large original sample and at least 1000 bootstrap samples

15 / 64 16 / 64 Bootstrap Confidence Intervals: Bias-Corrected, Accelerated Bias correction: zˆ0 Percentile Intervals BC ( a) The BCa CI adjusts the confidence intervals for bias due to small • samples by employing a normalizing transformation through two correction factors. # T ⇤ T This is also a percentile interval, but the percentiles are not zˆ = 1 b  • 0 necessarily the ones you would think. B ! 1 Using strict percentile intervals, Tˆ ,Tˆ T ⇤( ),T ⇤( ) • lo up ⇡ B B This just gives the inverse of the normal CDF for proportion of • Here, Tˆ ,Tˆ T ⇤( 1),T ⇤( 2) ⇥ ⇤ h i bootstrap replicates less than T . • lo up ⇡ B B 1 , and 2 , 1h 1 i Note that if # T ⇤ T = 0.5, then zˆ0 = 0. • ⇥ ⇤ ( ) • b  If T is unbiased, the proportion will be close to 0.5, meaning that • the correction is close to 0. zˆ0 + z( ) 1 = zˆ0 + ˆ ✓ 1 aˆ Z0 + z( ) ◆ ( 1 ) zˆ0 + z( ) 2 = zˆ0 + 1 aˆ Zˆ + z 1 ✓ ( 0 ( ))◆ 17 / 64 18 / 64

Acceleration: aˆ Evaluation Confidence Intervals

n 3 =1 T T i i (·) ( ) aˆ = 3 Õn 2 2 6 =1 T T i i (·) ( ) Normal Percentile BCa st st nd nÕ o Accuracy 1 order 1 order 2 order T i is the calculation of the original estimate T with each Transofrmation-respecting No Yes Yes • observation( ) i jacknifed out in turn. n T i T is =1 ( ) • (·) i n The acceleration constant corrects the bootstrap sample for •

19 / 64 20 / 64 Number of BS Samples Simple Simulation: Bootstrapping the Mean (Lower CI)

Now, we can revisit the idea about the number of bootstrap samples. The real question here is how precise to you want estimated • quantities to be. Often, we’re trying to estimate the 95% confidence interval (so • the 2.5th and 97.5th percentiles of the distribution). Estimating a quantity in the tail of a distribution with precision • takes more observations.

21 / 64 22 / 64

Bootstrapping the Make CIs Normal Theory CI: The median doesn’t have an exact sampling distribution (though • there is a normal approximation). Let’s consider bootstrapping the se <- sd(meds) median to get a sense of its sampling variability. bias <- mean(meds) - median(x) norm.ci <- median(x)- bias + qnorm((1+.95)/2)*se*c(-1,1) set.seed(123) Percentile CI: • n <- 25 med.ord <- meds[order(meds)] x <- rchisq(n,1,1) pct.ci <- med.ord[c(25,975)] meds <- rep(NA, 1000) pct.ci <- quantile(meds, c(.025,.975)) for(i in 1:1000) { BC CI: meds[i] <- median(x[sample(1:n, n, replace=T)]) • a zhat <- qnorm(mean(meds < median(x))) median} (x) medi <- sapply(1:n, function(i)median(x[-i])) Tdot <- mean(medi) ahat <- sum((Tdot-medi)^3)/(6*sum((Tdot-medi)^2)^(3/2)) ## [1] 1.206853 zalpha <- 1.96 z1alpha <- -1.96 summary(meds) a1 <- pnorm(zhat + ((zhat + zalpha)/(1-ahat*(zhat + zalpha)))) a2 <- pnorm(zhat + ((zhat + z1alpha)/(1-ahat*(zhat + z1alpha)))) bca.ci <- quantile(meds, c(a1, a2)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. a1 <- floor(a1*1000) ## 0.2838 0.6693 1.2069 1.2897 1.7855 3.7398 a2 <- ceiling(a2*1000) bca.ci <- med.ord[c(a2, a1)]

23 / 64 24 / 64 Looking at CIs With boot package

library(boot) set.seed(123) med.fun <- function(x, inds) med <- median(x[inds]){ mat <- rbind(norm.ci, pct.ci, bca.ci) med rownames(mat) <- c("norm", "pct", "bca") boot.med} <- boot(x, med.fun, R=1000) colnames(mat) <- c("lower", "upper") boot.ci(boot.med) round(mat, 3) ## BOOTSTRAP CALCULATIONS ## Based on 1000 bootstrap replicates ## lower upper ## ## CALL : ## norm -0.046 2.294 ## boot.ci(boot.out = boot.med) ## pct 0.473 2.829 ## ## Intervals : ## bca 0.473 2.211 ## Level Normal Basic ## 95% (-0.107, 2.331 ) (-0.415, 1.940 ) ## ## Level Percentile BCa ## 95% ( 0.473, 2.829 ) ( 0.473, 2.211 ) ## Calculations and Intervals on Original Scale

25 / 64 26 / 64

BS Di↵erence in Correlations Bootstrap Confidence Intervals

We can bootstrap the di↵erence in correlations as well. boot.ci(boot.cor) set.seed(123) ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS data(Mroz) ## Based on 2000 bootstrap replicates cor.fun <- function(dat, inds) ## { tmp <- dat[inds, ] ## CALL : cor1 <- with(tmp[which(tmp$hc == "no"), ], ## boot.ci(boot.out = boot.cor) cor(age, lwg, use="complete")) ## cor2 <- with(tmp[which(tmp$hc == "yes"), ], ## Intervals : cor(age, lwg, use="complete")) ## Level Normal Basic cor2-cor1 ## 95% (-0.0401, 0.2369 ) (-0.0394, 0.2342 ) ## } boot.cor <- boot(Mroz, cor.fun, R=2000, strata=Mroz$hc) ## Level Percentile BCa ## 95% (-0.0351, 0.2385 ) (-0.0350, 0.2386 ) ## Calculations and Intervals on Original Scale

27 / 64 28 / 64 Bootstrapping Regression Models WVS Data

There are two ways to do this secpay: Imagine two secretaries, of the same age, doing Random-X Bootstrapping • • practically the same job. One finds out that the other earns The regressors are treated as random considerably more than she does. The better paid secretary, • Thus, we select bootstrap samples directly from the observations • and calculate the statistic for each bootstrap sample however, is quicker, more ecient and more reliable at her job. In your opinion, is it fair or not fair that one secretary is paid Fixed-X Bootstrapping • more than the other? The regressors are treated as fixed - implies that the regression • gini Observed inequality captured by GINI coecient at the model fit to the data is “correct” • The fitted values of Y are then the expectation of the bootstrap country level. • We attach a random error (usually resampled from the residuals) • democrat Coded 1 if the country maintained “partly free” or to each Yˆ which produces the fixed-x bootstrap sample Y • b⇤ “free” rating on Freedom House’s Freedom in the World meausre To obtain bootstrap replications of the coecients, we regress Y • b⇤ continuously from 1980 to 1995. on the fixed model matrix for each bootstrap sample

29 / 64 30 / 64

● Non−Democacy ● Non−Democacy ● Democracy ● Democracy

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.9 ● 0.9 ● ● ● ● ● ● ● Bootstrapping Regression: Example (1) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.8 ● 0.8 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● We fit a of secpay on the of gini and ● ● ● ● ● ● • ● ● ● ● ● ● 0.7 ● 0.7 ● democracy. ● ●

● ●

secpay ● secpay ●

0.6 ● 0.6 ● dat <- read.csv("http://quantoid.net/files/reg3/weakliem.txt", header=T) ● ● dat <- dat[-c(25,49), ] dat <- dat[complete.cases( 0.5 0.5 dat[,c("secpay", "gini", "democrat")]), ] ● ● mod <- lm(secpay ~ gini*democrat,data=dat)

0.4 0.4 summary(mod) ● ●

20 30 40 50 60 20 30 40 50 60 gini gini ## ## Call: ● Non−Democacy ● Democracy ## lm(formula = secpay ~ gini * democrat, data = dat) ##

● ● ## Residuals: ● ● ● ● ● ● ## Min 1Q Median 3Q Max ● ● ● ● 0.9 ● ● ## -0.155604 -0.047674 0.004133 0.046637 0.179825 ● ● ● ● ● ## ● ● ● ## Coefficients: ● ● ● ● ## Estimate Std. Error t value Pr(>|t|) 0.8 ● ● ● ● ● ## (Intercept) 1.059234 0.059679 17.749 < 2e-16 ***

● ● ● ● ## gini -0.004995 0.001516 -3.294 0.00198 ** ●

secpay ## democrat -0.486071 0.088182 -5.512 1.86e-06 *** ● ● ● ● ● ## gini:democrat 0.010840 0.002476 4.378 7.53e-05 *** 0.7 ● ## --- ● ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

● ● ## ## Residual standard error: 0.07118 on 43 degrees of freedom 0.6 ● ## Multiple R-squared: 0.5176,Adjusted R-squared: 0.484

● ## F-statistic: 15.38 on 3 and 43 DF, p-value: 6.087e-07

20 30 40 50 60 gini 31 / 64 32 / 64 Fixed-X Bootstrapping Fixed-X CIs

perc.ci <- sapply(1:4, function(x) boot.ci(boot.reg1, type=c("perc"), index =x){$percent )[4:5,] bca.ci <- sapply(1:4, } function(x) boot.ci(boot.reg1, type=c("bca"), index =x)$bca{ )[4:5,] If we are relatively certain that the model is the “right” model colnames(bca.ci)}<- colnames(perc.ci) <- names(coef(mod)) • rownames(bca.ci) <- rownames(perc.ci) <- c("lower", "upper") (i.e., it is properly specified), then this makes sense. perc.ci

## (Intercept) gini democrat gini:democrat ## lower 0.9427854 -0.008099226 -0.6507048 0.005767702 boot.reg1 <- Boot(mod, R=500, method="residual") ## upper 1.1808116 -0.002214939 -0.3211883 0.015383294

bca.ci

## (Intercept) gini democrat gini:democrat ## lower 0.9398972 -0.008080956 -0.6448495 0.005651383 ## upper 1.1769527 -0.002060481 -0.3079673 0.015282963

33 / 64 34 / 64

Random-X resampling (1) Random-X Resampling (2)

boot.reg2 <- Boot(mod, R=500, method="case") perc.ci <- sapply(1:4, function(x) boot.ci(boot.reg2, type=c("perc"), index =x){$percent )[4:5,] bca.ci <- sapply(1:4, } Random-X resampling is also referred to as observation function(x) boot.ci(boot.reg2, type=c("bca"), • index =x)$bca{ )[4:5,] resampling # change row and} column names of output colnames(bca.ci) <- colnames(perc.ci) <- names(coef(mod)) It selects B bootstrap samples (i.e., resamples) of the rownames(bca.ci) <- rownames(perc.ci) <- c("lower", "upper") • # print output observations, fits the regression for each one, and determines the perc.ci standard errors from the bootstrap distribution ## (Intercept) gini democrat gini:democrat ## lower 0.9373337 -0.007712521 -0.6668725 0.004825119 This is done fairly simply in R using the boot function from the ## upper 1.1586946 -0.001223048 -0.2716579 0.015987613 • boot library, too. bca.ci

## (Intercept) gini democrat gini:democrat ## lower 0.9598786 -0.008503679 -0.6943984 0.005732083 ## upper 1.1854891 -0.001937200 -0.3025729 0.016770086

35 / 64 36 / 64 What if we Retained the Outliers? Bootstrap Disrtibution

n <- c("Intercept", "GINI", "Democrat", "Gini*Democrat") par(mfrow=c(2,2)) for(j in 1:4) plot(density{(boot.reg1$t[,j]), xlab="T*", ylab="Density", main = n[j]) lines(density(boot.reg2$t[,j]), lty=2) }

37 / 64 38 / 64

The Bootstrap Distribution Jackknife-after-Bootstrap

Intercept GINI 4 1.5 3 The jackknife-after-bootstrap provides a diagnostic of the 1.0

2 • Density Density bootstrap by allowing us to examine what would happen to the

0.5 density if particular cases were deleted 1 Given that we know that there are outliers, this diagnostic is 0 0.0 • important 0.5 0.6 0.7 0.8 0.9 1.0 1.1 −0.5 0.0 0.5 T* T* The jackknife-after-bootstrap plot is produced using the Democrat Gini*Democrat • jack.after.boot function in the boot package. Once again, the 1.0

2.5 index=2 argument asks for the results for the slope coecient in 0.8

2.0 the model 0.6 1.5

Density Density jack.after.boot(boot.reg1, index=2) 0.4 1.0 0.2 0.5 0.0 0.0

−0.6 −0.4 −0.2 0.0 0.2 0.4 −1.0 −0.5 0.0 0.5 1.0 1.5

T* T*

39 / 64 40 / 64 Jackknife-after-Bootstrap (2) R code for Jack After Boot

par(mfrow=c(2,2)) The jack.after.boot plot is constructed as follows: jack.after.boot(boot.reg1, index=1) • the horizontal axis represents the standardized jackknife value (the jack.after.boot(boot.reg1, index=2) • standardized value of the di↵erence with that observation taken jack.after.boot(boot.reg1, index=3) out) jack.after.boot(boot.reg1, index=4) The vertical axis represents various quantiles of the bootstrap • statistic Horizontal lines in the graph represent the bootstrap distribution • at the various quantiles par(mfrow=c(2,2)) Case numbers are labeled at the bottom of the graph so that each jack.after.boot(boot.reg2, index=1) • observation can be identified jack.after.boot(boot.reg2, index=2) jack.after.boot(boot.reg2, index=3) jack.after.boot(boot.reg2, index=4)

41 / 64 42 / 64 0.2

* ** * * * * * * * * * * * **** * *** * * * * * * ** * 0.4 * * * *** ** * ******* *** * * * ****** * *** t) * *** * * t) * * t) * * t) * * * * * ** * ** ** *** ** ** * * ** * * * ** * **** ******* **** * * * ** * ******* * * − * * * * * − ** * * * ** * * ** * − * *** * * − * * ** ** * * ** * * * * * ** * * * * * ** * * * * **** * * * * * ** ** ************ *** * * * *** * * * * ** * * ** *** ** * * * * * * * * * * ** ** * * 0.2 **** ********** * * * * * ** *** ***** 0.1 * * * * ** * * * * * ** ** * * *** ***** * * * * ** * ** * * * * * *** * * * * * * * ** ** ** * ** * * * * ** * * ** *** * * * ***** * ** 0.5 * * ** *** ** *** ** * *** *** * ** *** *** * ***** ** * * * * * * ** * * * * ** *** *** ** * * ** *************** * *** * ** *** * * * * * * * * * *** ** ***** *** ** *** ****** * * * * ** ** ***** ** * ** * 0.2 * * * * * *** * * * * ** ***** **** * *

iles of (T* iles of (T* iles of (T* iles of (T* * − ** * ** * ** ** *** ***** ****** ** * * * * * ** * − − * − 0.0 * * * *** * * * * * * * ************ ***** * ** * * * * * * 0.0 * * ******* * * * 0.0 * * ** * *** *** ****** *** **** ******** ** ** * * * * * ** * ******* * 0.0 ** * ******* * * * * *** ************** * * * * * * * * * * * * * * * *** ** * ** *** * **** * *** * * * * ** 0.2 0.1 * * * * *** * * ** * * * *** * * * * * * * * * ** ** ** ** * * * ***** *** * * − ** * * ** − * * * *** * ** * ** ** ** * * * * * * ** * * *** * *** ****** * ** ** * ********* * * ** *** * * ** * ** * * * * * * * *** * * ** * * * * * * ***** * * ** *** * * ** ***** * * * * * * 0.2 **** * * * * * * * * * * * * * * ***** ** * * * ********** ** ******* * * * ** * ** * * * * ** * * * * − ** * ** *** * *** * * * * * * * * ** 0.5 * *** * * * * * ** ** * * * * * * * ** * **** ** ** ** * ********* ****** * * * ** ******* * ** ******** * * * * * ** * * * ** ** * * * ** *** ** − * * * **** ** ** **** * * * * * * * *** * * ** * ** * ** ***** ** **** * * * * * *********** **** * * 0.4 * * * * * * *** * * * * * * * * * ** * * ** * ** ********** * ** * * * ** ***** * **** − * * * * * *** * * * **

0.2 * * * * *

− 49 20 6 4 32 48 47 36 16 10 16 30 8 23 18 1 31 20 35 19 49 46 42 33127 191728 1 22 34 8 7 114531 12 543 2 12 5 35 34 41 39 8 21 26 29 10 11 41 38 15 7 40 32 9 25 37 16 6 945 1348 15 23 32 47 2819 38 106 16 464 5, 10, 16, 50, 84, 90, 95 % 5, 10, 16, 50, 84, 90, 95 % 5, 10, 16, 50, 84, 90, 95 % 5, 10, 16, 50, 84, 90, 95 % 25 31 19 17 37 33 44 42 11 30 26 17 46 49 45 36 47 24 25 14 0.4 443 18 2440 38358 20 32 1 15 3017 914 344 26 25 0.6 − 1.0 9 14 22 7 45 15 46 13 23 29 − 28 21 39 3 22 42 4 5 34 2 395 12 1029117 30 34 22 20 41 2 35 292736 42 39 49 −

0.3 40 24 43 18 1 27 3 38 28 13 27 33 44 37 48 43 6 12 33 26443614 21241 47 23 13 2148 4024 18 3337 −

−2 −1 0 1 2 −2 −1 0 1 2 −5 −4 −3 −2 −1 0 1 2 −2 0 2 4

standardized jackknife value standardized jackknife value standardized jackknife value standardized jackknife value 0.3 1.0 0.4 * * *** * * * * * * * * * * ** * * * * * * * * *** * *** * * * * * ** ** *** ***** * ** ** ** ** * * * * * * ******* * *** * * * * * * t) * * * ** * ** * ** * * * * t) * * * ** * * * * * * t) * ** * * ** * * * t) * ** * * *** ** * 0.2 * * * * * * * * *** ** * ** * * * ** * *** * − * * * * ** ** *** * * * − * * *** *** * * ** * * − ** − * ** **** ***** * * * * * * **** * ** * 0.5 ** * * * * * ** * * * ** * ** * * * * * * * *** * * ** * ** ** * * * * * *** *** * * * * * ** * * * * ** * * * * * * * * ****** ** *** * ** * * * * * * * * * * * ** * * * * * * * * * *** **** * ** * *** * * *** * * * ************* **** *** * * *** *** * *** **** * * * * * ** ** ** * * * * ** * *** *** * * * ** * * * * ** ** * 0.5 * * ** *** * * * ** * * * * ** * * * 0.2 * * * * * * * ** * * * * * * **** **** *** * * * * * * * *** ********* ** * *** * * * ** **** ** ******* * * * * * * * * ** * * * * * * ** * * *** *** *** * * * * 0.1 * * iles of (T* iles of (T* iles of (T* iles of (T*

− − − − * * * ** * * * * ** ** * * * * * * * ** * * * * 0.0 * * * * ** * * ** ** ** * * * * ** ** * ** ***** * * * * ** * * * 0.0 * * 0.0 * * * ** ** * * * ***** *** * ** * * * * * **** * * * * ** * * * * ****** ** * ** ** *** ** **** * ** * * * * * ** * * * 0.0 * * * * ******* ** * ** * * * * ** * * ***** * * * * ** * *

0.1 * * − * * * * * * * * * * * ******* ** ** * * * ** ** * * * * * * * * * * * ** ** **** * * * * * ** * * * * * ** ***** ** * *** **** ** * * * * * * * ** * * ** * * * ** * ***** ** ** * 0.5 * * ***** ****** * **** * ** * * * * * * * * 0.2 * * * * * ** *** * * * * * * * ** * * * ** * ****** * * *** ****** **** − * * ** * * * ** * * 0.5 * * * * ** * **** * * **** * * * * * * * * *** * * ** * * * * ** ** * * ** * *** * * * − * * ***** * ** *** * * *** *** ***** ****** * * * * ** * * **** *** * **** *** * * * * * − * ** * * *** * * ** * * * * ********* *** ***** * * * * * * * * ** 0.2 * * * * * * * * * * * * ** * * * *** * ** ** *** * * ** * * * * * * * * * * * * * *********** ****** ** ** * ** ** *** * * * − * * * ** * ****** ** * * * * * * * * ** * * * ** * * * * * * ********** * *** ** ** * *** * * ** * * * * * * ** *** ** * * * * * * * * * ** ** ** ** ** * * * * * * * * * * * * * * *** * 26 8 16 37 39 4 32 34 14 6 25 6 44 2 48 35 18 21 10 49 22 1 18 458 249 44 6 4 49 4 3 14 29 8 487 41 17 1.0 0.3 0.4 29 10 13 5 18 19 48 3 2 12 12 45 31 36 9 20 5 13 17 41 16 34 48 235 1129 42 3 32 − 25 4337 3138 24 4613 34 1 − − 5, 10, 16, 50, 84, 90, 95 % 5, 10, 16, 50, 84, 90, 95 % 5, 10, 16, 50, 84, 90, 95 % 5, 10, 16, 50, 84, 90, 95 % 41 21 15 27 7 20 36 11 40 24 14 3 34 23 19 32 27 37 46 26 17 20 7 30 273614 5 33 25 40 3344 12 1130 21 19 47 16 1.0

28 49 30 43 47 23 44 42 31 25 − 24 42 11 7 47 38 16 22 8 29 47 28 2115 463812 37 40 49 32 6 42 9 2745 15 18 23 22

0.4 46 22 17 38 35 9 331 45 1 33 40 4 39 43 15 30 28 23 41 1913103126 43 39 39 5 26 36 35 102 28 20 − 1.5 − −2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2

standardized jackknife value standardized jackknife value standardized jackknife value standardized jackknife value

43 / 64 44 / 64 Fixed versus random Bootstrapping the Loess Curve

We could also use bootstrapping to get the Loess Curve. First, let’s try the fixed-x resampling:

Why do the samples di↵er? library(foreign) dat <- read.dta("http://quantoid.net/files/reg3/jacob.dta") Fixed-X resampling enforces the assumption that errors are rescale01 <- function(x)(x-min(x, na.rm=T)/diff(range(x, na.rm=T))) • seq. <- function(x)seq(min(x), max(x), length=250) randomly distributed by resampling the residuals from a common dat$perot <- rescale01(dat$perot) dat$chal <- rescale01(dat$chal_vote) distribution s <- seq.range(dat$perot) df <- data.frame(perot =s) As a result, if the model is not specified correctly (i.e., there is lo.mod <- loess(chal ~ perot, • data=dat, span=.75) un-modeled nonlinearity, heteroskedasticity or outliers) these resids <- lo.mod$residuals yhat <- lo.mod$fitted attributes do not carry over to the bootstrap samples lo.fun <- function(dat, inds) the e↵ects of outliers was clear in the random-X case, but not with boot.e <- resids[inds]{ • boot.y <- yhat + boot.e the fixed-X bootstrap tmp.lo <- loess(boot.y ~ perot, data=dat, span=.75) preds <- predict(tmp.lo, newdata=df) preds

boot.perot1} <- boot(dat, lo.fun, R=1000)

45 / 64 46 / 64

Bootstrapping the Loess Curve (2) Make Confidence Intervals

out1 <- sapply(1:250, function(i)boot.ci(boot.perot1, index=i, type="perc")$perc) out2 <- sapply(1:250, function(i)boot.ci(boot.perot2, index=i, type="perc")$perc) preds <- predict(lo.mod, newdata=df,se=T) lower <- preds$fit - 1.96*preds$se Now, random-X resampling upper <- preds$fit + 1.96*preds$se

ci1 <- t(out1[4:5,]) lo.fun2 <- function(dat, inds) { ci2 <- t(out2[4:5,]) tmp.lo <- loess(boot.y ~ perot, ci3 <- cbind(lower, upper) data=dat[inds,], span=.75) preds <- predict(tmp.lo, newdata=df) plot(s, preds$fit, type = "l", preds ylim=range(c(ci1, ci2, ci3))) } poly.x <- c(s, rev(s)) boot.perot2 <- boot(dat, lo.fun, R=1000) polygon(poly.x, y=c(ci1[,1], rev(ci1[,2])), col=rgb(1,0,0,.25), border=NA) polygon(poly.x, y=c(ci2[,1], rev(ci2[,2])), col=rgb(0,1,0,.25), border=NA) polygon(poly.x, y=c(ci3[,1], rev(ci3[,2])), col=rgb(0,0,1,.25), border=NA) legend("topleft", c("Random-X", "Fixed-X", "Analytical"), fill = rgb(c(1,0,0), c(0,1,0), c(0,0,1), c(.25,.25,.25)), inset=.01)

47 / 64 48 / 64 Confidence Interval Graph When Might Bootstrapping Fail?

Random−X Fixed−X 1. Incomplete Data: since we are trying to estimate a population Analytical distribution from the sample data, we must assume that the

40 missing data are not problematic. It is acceptable, however, to use bootstrapping if multiple imputation is used beforehand

30 2. Dependent data: bootstrapping assumes independence when sampling with replacement. It should not be used if the data are preds$fit dependent 20 3. Outliers and influential cases: if obvious outliers are found, they should be removed or corrected before performing the bootstrap

10 (especially for random-X). We do not want the simulations to depend crucially on particular observations

5 10 15 20 25 30

s

49 / 64 50 / 64

Cross-Validation (1) Cross-Validation (2) If no two observations have the same Y,ap-variable model fit to • p + 1 observations will fit the data precisely A basic, but powerful, tool for building this implies, then, that discarding much of a dataset can lead to • • Cross-validation is similar to bootstrapping in that it resamples better fitting models • Of course, this will lead to biased that are likely to give from the original data • quite di↵erent predictions on another dataset (generated with the The basic form involves randomly dividing the sample into two same DGP) • subsets: Model validation allows us to assess whether the model is likely • The first subset of the data (screening sample) is used to select or to predict accurately on future observations or observations not • estimate a statistical model used to develop this model The second subset is then used to test the findings • Three major factors that may lead to model failure are: Can be helpful in avoiding capitalizing on chance and over-fitting • 1. Over-fitting • 2. Changes in measurement the data - i.e., findings from the first subset may not always be 3. Changes in sampling confirmed by the second subsets External validation involves retesting the model on new data Cross-validation is often extended to use several subsets (either a • • collected at a di↵erent point in time or from a di↵erent preset number chosen by the researcher or leave-one-out population cross-validation) Internal validation (or cross-validation) involves fitting and • evaluating the model carefully using only one sample 51 / 64 52 / 64 Cross-Validation (3) Cross-Validation (4)

The data are split into k subsets (usually 3 k 10), but can How many observations should I leave out from each fit? • also use leave-one-out cross-validation   There is no rule on how many cases to leave out, but Efron • Each of the subsets are left out in turn, with the regression run (1983) suggests that grouped cross-validation (with • on the remaining data approximately 10% of the data left out each time) is better than leave-one-out cross-validation Prediction error is then calculated as the sum of the squared • errors: Number of repetitions ˆ 2 1 RSS = Yi Yi Harrell (2001:93) suggests that one may need to leave 10 of the ( ) • sample out 200 times to get accurate estimates We choose the model with the’ smallest average “error” • Cross-validation does not validate the complete sample 2 External validation, on the other hand, validates the model on a Yi Yˆi MSE = ( ) • new sample n Õ Of course, limitations in resources usually prohibits external We could also look to the model with the largest average R2 • validation in a single study •

53 / 64 54 / 64

Cross-Validation in R Cross-validating Span in Loess We could use cross-validation to tell us something about the span in Cross-validation is done easily using the cv.glm function in the our Loess model. • boot package 1. First, split the sample into K groups (usually 10). 2. For each of the k = 10 groups, estimate the model on the other 9 dat <- read.csv("http://www.quantoid.net/files/reg3/weakliem.txt", header=T) and get predictions for the omitted groups observations. Do this dat <- dat[-c(25,49), ] for each of the 10 subsets in turn. mod1 <- glm(secpay ~ poly(gini, 3)*democrat, data=dat) 1 2 3. Calculate the CV error: i ˆi mod2 <- glm(secpay ~ gini*democrat, data=dat) n ( ) 4. Potentially, do this lots of times and average across the CV error. deltas <- NULL Õ for(i in 1:25) deltas <- rbind{(deltas, c( set.seed(1) cv.glm(dat, mod1, K=5)$delta, n <- 400 cv.glm(dat, mod2, K=5)$delta) x <- 0:(n-1)/(n-1) ) f <- 0.2*x^11*(10*(1-x))^6+10*(10*x)^3*(1-x)^10 colMeans} (deltas) y <- f + rnorm(n, 0, sd = 2) ## [1] 0.008985782 0.008295404 0.005647222 0.005528784 tmp <- data.frame(y=y, x=x) lo.mod <- loess(y ~ x, data=tmp, span=.75)

55 / 64 56 / 64 Minimizing CV Criterion Directly The Curve

best.span <- optimize(cv.lo2, c(.05,.95), form=y ~ x, data=tmp,

numiter=5, K=10) ● ● ● ● CV ● ● best.span ●● ● GCV ● ● ● ● ● ● ● ● 10 ● ●● ● ● ● ● ●● ● ● ● ●● ## $minimum ●● ● ● ● ● ●● ● ● ●●● ● ●● ● ● ● ●● ●●●● ● ## [1] 0.2271807 ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●● ## ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●● ## $objective ●● ● ● ● ● ● 5 ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ## [1] 3.912508 ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ●● ● ●●● ● ● ● ● ● ● ● y ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ●●● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● There is also a canned function in fANCOVA that optimizes the span ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● 0 ● ● ●● ● ● ● ●●●●●●●●● ● ● ● ● ● ● via AICc or GCV. ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● library(fANCOVA) ● ● ● ● best.span2 <- loess.as(tmp$x, tmp$y, criterion="gcv") ● ● ● 5 ● best.span2$pars$span −

0.0 0.2 0.4 0.6 0.8 1.0

## [1] 0.1483344 x

57 / 64 58 / 64

Manually Cross-Validating in the Y-J Transform Example

cvoptim_yj <- function(pars, form, data, trans.vars, K=5, numiter=10) Sometimes, optimizing the cross-validation criterion fails. require(boot) { require(VGAM) form <- as.character(form) Randomness in the CV procedure can produce a function that for(i in 1:length(trans.vars)) • form <- gsub(trans.vars[i],{ paste("yeo.johnson(",trans.vars[i], has several local minima. ",",pars[i],")", sep=""), form) form} <- as.formula(paste0(form[2], form[1], form[3])) You could force the same random split at every evaluation by m <- glm(as.formula(form), data, family=gaussian) • d <- lapply(1:numiter, function(x)cv.glm(data, m, K=K)) hand-coding the CV, but this might not be the best idea. mean(sapply(d, function(x)x$delta[1])) If optimization of the CV criterion fails, you could always do it } • lams <- seq(0,2, by=.1) manually. s <- sapply(lams, function(x)cvoptim_yj(x, form=prestige ~ income + education + women, data=Prestige, trans.vars="income", K=3)) plot(s ~ lams, type="l", xlab=expression(lambda), ylab="CV Error")

59 / 64 60 / 64 Figure Summary and Conclusions

resampling techniques are powerful tools for estimating standard 75 • errors from small samples or when the statistics we are using do not have easily determined standard errors

70 Bootstrapping involves taking a “new” random sample (with • replacement) from the original data We then calculate bootstrap standard errors and statistical tests

65 •

CV Error from the average of the statistic from the bootstrap samples Jackknife resampling takes new samples of the data by omitting •

60 each case individually and recalculating the statistic each time Cross-validation randomly splits the sample into two groups, • comparing the model results from one sample to the results from 55 the other

0.0 0.5 1.0 1.5 2.0

λ

61 / 64 62 / 64

Readings

Davison, Anthony C. and D.V. Hinkley. 1997. Bootstrap Methods and Their Applications.Cambridge: Cambridge University Press. Efron, Bradley and Robert Tibshirani. 1993. An Introduction to the Bootstrap. New York: Chapman & Today: Resampling and Cross-validation Hall. Fox, John. 2008. Applied Regression Analysis and Generalized Linear Models, 2nd edition.Thousand Fox (2008) Chapter 21 Oaks, CA: Sage, Inc. ⇤ Ronchetti, Elvezio, Christopher Field and Wade Blanchard. 1997. “Robust Linear by Stone (1974) Cross-validation.” Journal of the American Statistical Association 92(439):1017–1023. ⇤ Stone, Mervyn. 1974. “Cross-validation and Assessment of Statistical Predictions.” Journal of the Royal Efron and Tibshirani (1993) Statistical Society, Series B 36(2):111–147. Davison and Hinkley (1997) Ronchetti, Field and Blanchard (1997)

63 / 64 64 / 64