<<

Hypothesis Testing Using Distributions T.Scofield 10/03/2016

Randomization Distributions in Two-Proportion Settings

By calling our setting a “two proportion” one, I that the frame has two binary categorical variables, when the one that delineates which of two groups a subject comes from serves as the explanatory variable, and the other, the response variable, also has just two outcomes. In the cocaine addiction data, we have an explanatory variable, “treatment”, which has three levels: “De- sipramine”, “Lithium”, and “Placebo.” We cut that back to two by ignoring one set of patients, perhaps those receiving Desipramine, thereby giving us just two groups to consider. The response variable is “relapsed or not?” which has just two values, “yes” or “no.” We focus on the relapsers. Natural hypotheses for a study to see if Lithium helps to decrease the chance of relapse are

H0 : pL − pP = 0, Ha : pL − pP < 0.

The sample proportions among the lithium and placebo groups are pˆL = 18/24 and pˆP = 20/24, giving us test 18 20 2 . pˆ − pˆ = − = − = −0.083. L P 24 24 24 Like the study about tapping fingers under the influence of caffeine, this study is an , where the treatment (Lithium or Placebo) was randomly assigned to patients. When we generate a randomization distribution, we want to be faithful to this process, even as we take the null hypothesis into account. That is, the mental image of dropping slips of paper into two bags, one bag containing the 48 relapse results (38 “yes” and 10 “no”) and the other containing the 48 treatments (24 “Lithiums” and 24 “Placebos”), and randomly assigning the latter to the former as we select our randomization sample, is achieving both goals. Generating a randomization distribution, however, is trickier in RStudio for this situation than in earlier scenarios, primarily because of the work we must do to prepare data for randomization samples. You may well prefer to use StatKey, the software meant to accompany the textbook, over RStudio, for cases involving two proportions. I will, however, provide details in RStudio for your perusal. The main difficulty, as indicated above, is preparing data. Here are two approaches.

Approach 1: Recreate the data from scratch

We have done this sort of thing once before, back in Section 2.1. Perhaps you recall the commands. part1 <- do(6)* data.frame(Drug="Lithium", Relapse="no") part2 <- do(18)* data.frame(Drug="Lithium", Relapse="yes") part3 <- do(4)* data.frame(Drug="Placebo", Relapse="no") part4 <- do(20)* data.frame(Drug="Placebo", Relapse="yes") addictTreatments <- rbind(part1, part2, part3, part4)

Approach 2: Filtering the supplied data frame

It turns out we don’t actually need to recreate the data, as it has been supplied to us as part of the Lock5withR package in a data frame called CocaineTreatment. But working with it is not so straightforward as it

1 would at first seem, because this data frame contains all the patients, including those who received the drug called Desipramine. We can select the desired subset by leaving out these subjects: myFilteredData <- subset(CocaineTreatment, Drug != "Desipramine")

However, there seems to be a lingering “memory” that there were three levels for the Drug variable. You see this, for instance, when you produce a table on Drug: tally(~Drug, myFilteredData)

## Drug ## Desipramine Lithium Placebo ## 0 24 24 While the count of Desipramine patients is 0, we would prefer that our filtered data frame not know Desipramine is part of this study. One way to make it “forget” is to combine the removal of Desipramine patients with the droplevels() command. myFilteredData <- droplevels(subset(CocaineTreatment, Drug != "Desipramine")) tally(~Drug, myFilteredData)

## Drug ## Lithium Placebo ## 24 24 Now our Drug variable truly has just two levels in the myFilteredData data frame.

Once data has been prepared . . .

If you carried out the commands above, you now have two data frames, addictTreatments and myFil- teredData, which can be used for our analysis. Either will work, but I will use myFilteredData. head(myFilteredData)

## Drug Relapse ## 25 Lithium no ## 26 Lithium yes ## 27 Lithium yes ## 28 Lithium yes ## 29 Lithium yes ## 30 Lithium no We obtain our test statistic from the sample itself: diff(prop(Relapse~Drug, data=myFilteredData))

## no.Placebo ## -0.08333333 As when dealing with the difference of two (see the example using data from CaffeineTaps in a prior handout), our null hypothesis dictates that the drug received (Lithium vs. Placebo) is not actually a factor, and we should generate many randomization by shuffling values of the explanatory variable. One randomization statistic is obtained with the command diff(prop(Relapse~shuffle(Drug), data=myFilteredData))

## no.Placebo ## 0.1666667 and this may be repeated many times to obtain a randomization distribution:

2 manyDiffs <- do(5000)* diff(prop(Relapse~shuffle(Drug), data=myFilteredData)) head(manyDiffs)

## no.Placebo ## 1 0.25000000 ## 2 -0.16666667 ## 3 0.25000000 ## 4 0.08333333 ## 5 -0.16666667 ## 6 0.00000000 The column, containing 5000 randomization statistics, has been given the curious name no.Placebo. We may view a and mark the region corresponding to our P -value: histogram(~no.Placebo, data=manyDiffs, groups = no.Placebo <=-0.083333, width=.1)

2.5 2.0 1.5 1.0 Density 0.5 0.0 −0.4 −0.2 0.0 0.2 0.4 no.Placebo

nrow(subset(manyDiffs, no.Placebo <=-0.083333)) / 5000

## [1] 0.3488 This P -value, here approximately 0.36, represents the probability, in a world where Lithium does not help deter relapse into cocaine addiction, of obtaining a sample with a test statistic (difference in sample proportions) of −0.08333 or more. This P -value is not statistically significant under any of the usual significance levels α = 0.1, 0.05 or 0.01. In fact, such samples statistics would arise about 36% of the time, which makes our sample statistic appear consistent with the null hypothesis. We fail to reject the null hypothesis.

Example: Hypothesis Test for Positive Correlation (NFL Malevo- lence)

The hypotheses (explained in the text, Section 4.4):

H0 : ρ = 0, Ha : ρ > 0.

The test statistic: cor(ZPenYds ~ NFL_Malevolence, data=MalevolentUniformsNFL)

## [1] 0.429796 Generation of many randomization statistics:

3 manyCors <- do(5000)* cor(ZPenYds ~ shuffle(NFL_Malevolence), data=MalevolentUniformsNFL) head(manyCors)

## cor ## 1 -0.22396686 ## 2 -0.39130305 ## 3 0.06329420 ## 4 0.11707616 ## 5 0.19503326 ## 6 0.09328136 histogram(~cor, data=manyCors, groups=cor>=0.42979)

1.5 1.0

Density 0.5 0.0 −0.5 0.0 0.5 cor

The P -value: nrow(subset(manyCors, cor>=0.42979)) / 5000

## [1] 0.0108 In the case where the significance level α = 0.05, this result is statistically signficant, and we would reject the null hypothesis in favor of the alternative, concluding that there is a positive correlation.

Example: Is the mean body temperature really 98.6◦?

The hypotheses: H0 : µ = 98.6, Ha : µ 6= 98.6. The test statistic: mean(~BodyTemp, data=BodyTemp50)

## [1] 98.26 The natural thing would be to simulate the bootstrap distribution for x¯, as when we constructed a confidence interval for the population mean µ: manyMeans = do(5000)* mean(~BodyTemp, data=resample(BodyTemp50)) head(manyMeans)

## mean ## 1 98.332 ## 2 98.190

4 ## 3 98.250 ## 4 98.280 ## 5 98.206 ## 6 98.280 histogram(~mean, data=manyMeans)

3 2

Density 1 0 98.0 98.2 98.4 98.6 mean

But this cannot be an proper simulation of the null distribution, as it is not centered at the right place. It appears the center is about 98.26, the value of our point estimate x¯, not at the hypothesized (population) mean of 98.6, which is what happens whenever we bootstrap a mean. Our randomization statistics should not be the same as bootstrap statistics here, but need to be modified so that they are centered on the proposed mean 98.6. The modification can simply be that we add to each of our sample means the difference between the intended center (98.6) and where they were centered above (at the sample mean x¯ = 98.26): that is, we should add 98.6 − 98.26 = 0.34: manyMeans = do(5000)*( mean(~BodyTemp, data=resample(BodyTemp50)) + 0.34) names(manyMeans)

## [1] "result" histogram(~result, data=manyMeans, groups = abs(result-98.6)>=0.34)

3 2

Density 1 0 98.2 98.4 98.6 98.8 99.0 result

We see this modified test statistic has a randomization distribution centered where it ought to be if serving as the null distribution. We have attempted to shade those regions in both tails corresponding to randomization statistics at least as extreme as ours, though there are very few. We obtain the approximate P -value by calculating the area in one tail and doubling it:

5 nrow(subset(manyMeans, result <= 98.26)) *2/ 5000

## [1] 0.002 Given this small P -value, we reject the null hypothesis and conclude that the actual (population) mean body temperature is something other than 98.6.

Example 4.34: A New Wrinkle on Finger Tapping and Caffeine

This example has already been done adequately. Since it was a controlled, randomized experiment in which one treatment, either caffeine or placebo, was assigned randomly to each subject, we obtained our randomization distribution in a manner that also randomly assigned treatment values while adhering to the null hypothesis that “treatment doesn’t matter.” We obtained one randomization statistic with the command diff(mean(Taps ~ shuffle(Caffeine), data=CaffeineTaps))

and an entire distribution of such statistics by repeating this command often. Example 4.34 challenges us to imagine different ways of studying the question: “Does caffeine increase tapping rates?” Surely there are other approaches besides a controlled randomized experiment. The Locks have us consider two different studies one might undertake. 1. An : Instead of assigning treatments, we find subjects who have already self- selected their own treatments, some having had caffeine (probably as part of a daily routine, drinking coffee in the morning), and others who have not. Subjects from both groups have their tap rates measured, and results of both variables are again recorded. 2. A matched pairs study: This time, subjects undergo both treatments, having their tap rates measured under each. The order of the treatments is assigned randomly, so that some receive caffeine first, while for others it is the placebo first. Each subject is, then, the source of two numbers, the “caffeine tap rate” and the “placebo tap rate.” Our effective data for each subject, however, would be the difference: (caffeine tap rate) − (placebo tap rate).

In each of these scenarios, the change in the manner in which data is collected calls for a change in the manner in which randomization statistics are produced. The easier of these two alternate study paradigms to handle in RStudio is the matched pairs case, which we discuss next. We will not delve into the observational study case, but suffice it to say that our treatment should be something like the approach suggested by Hunter Pham (see earlier course notes), but modified so that the null hypothesis is respected. So, imagine that we have gathered a random sample of 10 people for a matched pairs study on whether caffeine causes higher tapping rates. We randomly select 5 to undergo the caffeine treatment first followed by placebo, while the other 5 will receive placebo first and then caffeine. (For a blind study, which is preferred, subjects still do not know which treatment they receive first.) Here, displayed below, are some pretend data from a matched pairs experiment. This data frame, matchedPairsCaffTaps, is not part of any package you can load. Commands that generate it are given below. set.seed(50) matchedPairsCaffTaps = data.frame(placebo=round(runif(10,234,255),1), caffeine=round(runif(10,241,258),1), first=sample(c(rep("C",5),rep("P",5)))) matchedPairsCaffTaps$obsDiff = matchedPairsCaffTaps$caffeine - matchedPairsCaffTaps$placebo

The resulting data set is displayed here.

6 matchedPairsCaffTaps

## placebo caffeine first obsDiff ## 1 248.9 247.6 P -1.3 ## 2 243.2 245.6 C 2.4 ## 3 238.2 251.9 P 13.7 ## 4 250.1 242.3 P -7.8 ## 5 244.8 245.7 P 0.9 ## 6 234.9 252.5 C 17.6 ## 7 248.7 255.2 C 6.5 ## 8 247.6 247.2 C -0.4 ## 9 234.9 242.3 C 7.4 ## 10 236.3 243.9 P 7.6 The null and alternative hypotheses, which should be understood before data has been collected are these

H0 : µDiff = 0, Ha : µDiff > 0.

From our data, we obtain the sample mean of observed differences in the usual way. mean(~obsDiff, data=matchedPairsCaffTaps)

## [1] 4.66 This is our test statistic. In generating randomization statistics, we want to use the data we have, but adhere to the null hypothesis, which implies that caffeine should not dictate which tap rate, the one under caffeine or placebo, is larger. This would mean that the sign of the difference is random, coming out positive or negative like flips of a coin come out heads or tails. The command sample(c(-1,1), 10, replace=TRUE)

## [1] -1 -1 -1 1 -1 1 -1 1 -1 -1 acts like 10 coin flips, except that it produces (-1) and 1 rather than “H” or “T”. We simulate one randomization statistic by this command mean(~ obsDiff*sample(c(-1,1), 10, replace=TRUE), data=matchedPairsCaffTaps)

## [1] 3.64 and obtain a randomization distribution by repeating it multiple times: manyMPMeans = do(3000)* mean(~obsDiff*sample(c(-1,1),10,replace=TRUE), data=matchedPairsCaffTaps) head(manyMPMeans)

## mean ## 1 -3.18 ## 2 2.70 ## 3 -1.82 ## 4 -1.82 ## 5 3.22 ## 6 -3.48 histogram(~mean, manyMPMeans, groups = mean >= 4.66)

7 0.15

0.10

0.05 Density

0.00 −5 0 5 mean

Our approximate P -value is nrow(subset(manyMPMeans, mean>=4.66)) / 3000

## [1] 0.04566667

8