<<

Hypothesis Testing with the Bootstrap

Noa Haas M.Sc. Seminar, Spring 2017 Bootstrap and Methods

Bootstrap Hypothesis Testing A bootstrap hypothesis test starts with a test - 푡(풙) (not necessary an estimate of a parameter). We seek an achieved significance level ∗ 퐴푆퐿 = 푃푟표푏퐻0 푡 풙 ≥ 푡(풙) Where the 풙∗ has a distribution specified by the null hypothesis 퐻0 - denote as 퐹0. Bootstrap hypothesis testing uses a “plug-in” style to estimate 퐹0.

The Two- Problem We observe two independent random samples:

퐹 → 풛 = 푧1, 푧2, … , 푧푛 𝑖푛푑푒푝푒푛푑푒푛푡푙푦 표푓 퐺 → 풚 = 푦1, 푦2, … , 푦푚 And we wish to test the null hypothesis of no difference between F and G,

퐻0: 퐹 = 퐺 Bootstrap Hypothesis Testing 퐹 = 퐺 • Denote the combined sample by 풙, and its empirical distribution by 퐹 0. • Under 퐻0, 퐹 0 provides a non parametric estimate for the common population that gave rise to both 풛 and 풚. 1. Draw 푩 samples of size 푛 + 푚 with replacement from 풙. Call the first n observations 풛∗ and the remaining 푚 – 풚∗ 2. Evaluate 푡(∙) on each sample - 푡(풙∗풃) 3. Approximate 퐴푆퐿푏표표푡 by ∗풃 퐴푆퐿 푏표표푡 = # 푡 풙 ≥ 푡 풙 /퐵 ∗풃 * In the case that large values of 푡(풙 ) are evidence against 퐻0

Bootstrap Hypothesis Testing 퐹 = 퐺 on the Mouse A of bootstrap replications of 푡 풙 = 푧 − 푦 for testing 퐻0: 퐹 = 퐺 on the mouse data. The proportion of values greater than 30.63 is .121. Calculating 푧 − 푦 푡 풙 = 휎 1 푛 + 1 푚 (approximate pivotal) for the same replications produced 퐴푆퐿 푏표표푡 = .128 Testing Equality of

• Instead of testing 퐻0: 퐹 = 퐺, we wish to test H0: 휇푧 = 휇푦, without assuming equal . We need estimates of 퐹 and 퐺 that use only the assumption of common 1. Define points 푧𝑖 = 푧𝑖 − 푧 + 푥 , 𝑖 = 1, … , 푛, and 푦 𝑖 = 푦𝑖 − 푦 + 푥 , 𝑖 = 1, … , 푚. The empirical distributions of 풛 and 풚 shares a common mean. 2. Draw 푩 bootstrap samples with replacement 풛∗, 풚∗ from 푧1 , 푧2 , … , 푧푛 and 푦 1, 푦 2, … , 푦 푚 respectivly 3. Evaluate 푡(∙) on each sample - 푧 ∗ − 푦 ∗ 푡 풙∗풃 = ∗ ∗ 휎 푧 1 푛 + 휎 푦 1 푚 4. Approximate 퐴푆퐿푏표표푡 by ∗풃 퐴푆퐿 푏표표푡 = # 푡 풙 ≥ 푡 풙 /퐵 Permutation Test VS Bootstrap Hypothesis Testing • Accuracy: In the two-sample problem, 퐴푆퐿푝푒푟푚 is the exact probability of obtaining a as extreme as the one observed. In contrast, the bootstrap explicitly samples from estimated probability mechanism. 퐴푆퐿 푏표표푡 has no interpretation as an exact probability. • Flexibility: When special symmetry isn’t required, the bootstrap testing can be applied much more generally than the permutation test. (Like in the two sample problem – permutation test is limited to 퐻0: 퐹 = 퐺, or in the one-sample problem) The One-Sample Problem We observe a random sample:

퐹 → 풛 = 푧1, 푧2, … , 푧푛

And we wish to test whether the mean of the population equals to some predetermine value 휇0 −

퐻0: 휇푧 = 휇0 Bootstrap Hypothesis Testing 휇푧 = 휇0 What is the appropriate way to estimate the null distribution? The empirical distribution 퐹 is not an appropriate estimation, because it does not obey 퐻0. As before, we can use the empirical distribution of the points: 푧𝑖 = 푧𝑖 − 푧 + 휇0, 𝑖 = 1, … , 푛 Which has a mean of 휇0. Bootstrap Hypothesis Testing 휇푧 = 휇0 The test will be based on the approximate 푧 −휇 distribution of the test statistic 푡 풛 = 0 휎 / 푛 ∗ ∗ We sample 푩 times 푧1 , … , 푧푛 with replacement from 푧1 , … , 푧푛 , and for each sample compute 푧 − 휇0 푡 풛 ∗ = 휎 / 푛 And the estimated ASL is given by ∗풃 퐴푆퐿 푏표표푡 = # 푡 풛 ≥ 푡 풛 /퐵 ∗풃 * In the case that large values of 푡 풛 are evidence against 퐻0

Testing 휇푧 = 휇0 on the Mouse Data

Taking 휇0 = 129, the observed value of the test statistic is 86.9 − 129 푡 풛 = = −1.67 66.8/ 7 (When estimating 휎 with the unbiased for ). For 94 of 1000 bootstrap samples, 푡 풛 ∗ was smaller than -1.67, and therefor 퐴푆퐿 푏표표푡 = .094 For reference, the student’s t-test result for the same null hypothesis on that data gives us 42.1 퐴푆퐿 = 푃푟표푏 푡6 < − = 0.07 66.8/ 7 Testing Multimodality of a Population

A is defined to be a local maximum or “bump” of the population density

The data: 푥1, … , 푥485 Mexican stamps’ thickness from 1872. The number of modes is suggestive of the number of distinct type of paper used in the printing. Testing Multimodality of a Population

Since the histogram is not smooth, it is difficult to tell from it whether there are more than one mode. A Gaussian kernel density with window size ℎ estimate can be used in order to obtain a smoother estimate: 푛 1 푡 − 푥𝑖 푓 푡; ℎ = 휙 푛ℎ ℎ 𝑖=1 As 풉 increases, the number of modes in the density estimate is non-increasing Testing Multimodality of a Population

The null hypothesis: 퐻0: 푛푢푚푏푒푟 표푓 푚표푑푒푠 = 1 Versus 푛푢푚푏푒푟 표푓 푚표푑푒푠 > 1. Since the number of modes decreases as ℎ increases, there is a smallest value of ℎ such that 푓 푡; ℎ has one mode. Call it ℎ 1. In our case, ℎ 1 ≈ .0068. Testing Multimodality of a Population

It seems reasonable to use 푓(푡; ℎ1) as the estimated null distribution for our test of 퐻0. It is the density estimate that uses least amount of smoothing among all estimated with one mode (conservative). A small adjustment to 푓 is needed because the formula artificially 2 increases the of the estimate with ℎ1. Let 푔 ∙; ℎ1 be the rescale estimate, that imposes variance equal to the sample variance.

A natural choice for a test statistic is ℎ1 - a large value of ℎ1 is evidence against 퐻0. Putting all of this together, the achieved significance level is 퐴푆퐿 = 푃푟표푏 ℎ ∗ > ℎ 푏표표푡 𝑔 ∙;ℎ 1 1 1 ∗ Where each bootstrap sample 풙 is drawn from 푔 ∙; ℎ1 Testing Multimodality of a Population

The from 푔 ∙; ℎ1 is given by: 1 − ∗ 2 2 2 ∗ 푥𝑖 = 푥 + 1 + ℎ1 휎 푦𝑖 − 푥 + ℎ1휖𝑖 ; 𝑖 = 1, … , 푛 ∗ ∗ Where 푦1, … , 푦푛 are sampled with replacement from 푥1, … , 푥푛, and 휖𝑖 are standard normal random variables. (called smoothed bootstrap) In the stamps data, out of 500 bootstrap samples, none had ∗ ℎ1 > .0068, so 퐴푆퐿 푏표표푡 = 0. The results can be interpreted in sequential manner, moving on to higher values of the least amount of modes. (Silverman 1981)

When testing the same for 퐻0: 푛푢푚푏푒푟 표푓 푚표푑푒푠 = 2, 146 ∗ samples out of 500 had ℎ2 > .0033, which translates to 퐴푆퐿 푏표표푡 = 0.292. In our case, the inference process will end here. Summary

A bootstrap hypothesis test is carried out using the followings: a) A test statistic 푡(풙) b) An approximate null distribution 퐹 0 for the data under 퐻0 ∗ Given these, we generate 퐵 bootstrap values of 푡(풙 ) under 퐹 0 and estimate the achieved significance level by ∗풃 퐴푆퐿 푏표표푡 = # 푡 풙 ≥ 푡(풙) /퐵 The choice of test statistic 푡 풙 and the estimate of the null distribution will determine the power of the test. In the stamp example, if the actual population density is bimodal, but the Gaussian kernel density does not approximate it accurately, then the suggested test will not have high power. Bootstrap tests are useful when the is not well specified. In cases where there is parametric alternative hypothesis, likelihood or Bayesian methods might be preferable.