<<

Chapter 5 Multiple Comparisons

Yibi Huang

Why Worry About Multiple Comparisons? • Familywise Error Rate • Simultaneous Confidence Intervals • Bonferroni’s Method • Tukey-Kramer Procedure for Pairwise Comparisons • Dunnett’s Procedure for Comparing with a Control • Scheff`e’sMethod for Comparing All Contrasts •

Chapter 5 - 1 Why Worry About Multiple Comparisons?

Recall that, at level α = 0.05, a hypothesis test will make a Type I error 5% of the time

I Type I error = H0 being falsely rejected when it is true

What if we conduct multiple hypothesis tests?

I When 100 H0’s are tested at 0.05 level, even if all H0’s are true, it’s normal to have 5 being rejected.

I When multiple tests are done, it’s very likely that some significant results may be NOT be TRUE FINDINGS. The significance must be adjusted

Chapter 5 - 2 Why Worry About Multiple Comparisons?

I In an , when the ANOVA F-test is rejected, we will attempt to compare ALL pairs of treatments, as well as contrasts to find treatments that are different from others. For an experiment with g treatments, there are g r(r−1) I 2 = 2 pairwise comparisons to make, and I numerous contrasts.

I It is not fair to hunt around through the data for a big contrast and then pretend that you’ve only done one comparison. This is called data snooping. Often the significance of such finding is overstated.

Chapter 5 - 3 5.1 Familywise Error Rate (FWER)

Given a single null hypothesis H0,

I recall a Type I error occurs when H0 is true but is rejected; I the level (or size, or Type I error rate) of a test is is the chance of making a Type I error.

Given a family of null hypotheses H01,H02,...,H0k ,

I a familywise Type I error occurs if H01,H02,...,H0k are all true but at least one of them is rejected;

I The familywise error rate (FWER), also called experimentwise error rate, is defined as the chance of making a familywise Type I error

FWER= P(at least one of H01,..., H0k is falsely rejected)

I FWER depends on the family. The larger the family, the larger the FWER. Chapter 5 - 4 Simultaneous Confidence Intervals

Similarly, a level 95% confidence level (L, U) for a parameter θ may fail to cover θ 5% of the time.

What if we construct multiple 95% confidence intervals (L1, U1), (L2, U2),..., (Lk , Uk ) for several different parameters { } θ1, θ2, . . . , θk , the chance that at least one of the intervals fails to cover the parameter is (a lot) more than 5%.

Chapter 5 - 5 Simultaneous Confidence Intervals

Given a family of parameters θ1, θ2, . . . , θk , a 100(1 α)% simultaneous confidence intervals{ is a family} of intervals−

(L1, U1), (L2, U2),..., (Lk , Uk ) { } that P(Li θi Ui for all i) > 1 α. ≤ ≤ − Note here that Li ’s and Ui ’s are random variables that depends on the data.

Chapter 5 - 6 Multiple Comparisons

To account for the fact that we are actually doing multiple comparison, we will need to make our C.I. wider, and the critical value larger to ensure the chance of making any false rejection < α. We will introduce several multiple comparison methods. All of them produce simultaneous C.I.’s of the form

estimate (critical value) (SE of the estimate) ± ×

and reject H0 when estimate t0 = | | > critical value. | | SE of the estimate Here the “estimates” and “SEs” are the same as in the usual t-tests and t-intervals. Only the critical values vary with methods, as summarized on Slide 19.

Chapter 5 - 7 5.2 Bonferroni’s Method Given that H01,..., H0k being all true, by the Bonferroni’s inequality we know

FWER = P(at least one of H01,..., H0k is rejected) Xk P(H0i is rejected) ≤ i=1 | {z } type I error rate for H0i If the Type I error rate for each of the k nulls can be controlled at α/k, then Xk α FWER = α. ≤ i=1 k

I Bonferroni’s method rejects a null if the comparisonwise P-value is less than α/k I Bonferroni’s method works OK when k is small I When k > 10, Bonferroni starts to get too conservative than necessary. The actual FWER can be much less than α. Chapter 5 - 8 Recall the Beet Lice Study in Chapter 4

I Goal: efficacy of 4 chemical treatments for beet lice

I 100 beet plants in individual pots in total, 25 plants per treatment, randomly assigned

I Response: # of lice on each plant at the end of the 2nd week

I The pots are spatially separated

30 Chemical ABCD MSE

y i• 12.00 14.96 18.36 24.00 47.84 20 The SE for pairwise comparison is 10 Number of Lice s 1 1 5 SE = √MSE + ni nj A B C D r 1 1 Treatment = √47.84 + = 1.956 25 25 Chapter 5 - 9 Example — Beet Lice (Bonferoni’s Method) Recall the pairwise comparison for the beet lice example. Chemical p-value Comparison Estimate SE t-value of t-test µB µA 2.96 1.956 1.513 0.13356 − µC µA 6.36 1.956 3.251 0.00159 < 0.0083 − −8 µD µA 12.00 1.956 6.134 1.91 10 < 0.0083 − × µC µB 3.40 1.956 1.738 0.0854 − −5 µD µB 9.04 1.956 4.621 1.19 10 < 0.0083 − × µD µC 5.64 1.956 2.883 0.00486 < 0.0083 − There are k = 6 tests. For α = 0.05, instead of rejecting a null when the P-value < α, Bonferoni’s method rejects when α 0.05 the P-value < = = 0.0083. k 6 Only AC, AD, BD, CD are significantly different. ABCD

Chapter 5 - 10 Example — Beet Lice (Bonferoni’s Method) Alternatively, to be significant at FWER = α based on Bonferoni’s correction, the t- for pairwise comparison must be at least

y i• y j• t = − > t SE N−g,α/2/k g 4 where k = 6 since there are 2 = 2 = 6 pairs to compare. Recall N g = 100 4 = 96, the critical value − − t = t 2.694. N−g,α/2/k 96,0.05/2/6 ≈ > qt(df=96, 0.05/2/6, lower.tail=F) [1] 2.694028 So a pair of treatments i and j are significantly different at FWER = 0.05 iff y y > SE t 1.956 2.694 5.27. | i• − j•| × N−g,α/2/k ≈ × ≈ Chemical ABCD ABCD y i• 12.00 14.96 18.36 24.00 Chapter 5 - 11 5.4 Tukey-Kramer Procedure for Pairwise Comparisons

I Family: ALL PAIRWISE COMPARISON µi µk − I For a balanced design (n1 = ... = ng = n), observe that

y i• y k• y max y min q t0 = q| − | p − = . | | 1 1  ≤ 2MSE/n √2 MSE n + n

in which q = y max−y min has a studentized range √MSE/n distribution.

I The critical values qα(g, N g) for the studentized range distribution can be found on− p.633-634, Table D.8 in the textbook

I Controls the (strong) FWER exactly at α for balanced designs (n1 = ... = ng ); approximately at α for unbalanced designs

Chapter 5 - 12 Tukey-Kramer Procedure for All Pairwise Comparisons

For all 1 i = k g, the 100(1 α)% Tukey-Kramer’s ≤ 6 ≤ − simultaneous C.I. for µi µk is − qα(g, N g) y y − SE(y y ) i• − k• ± √2 i• − k•

For H0 : µi µk = 0 v.s. Ha : µi µk = 0, reject H0 if − − 6

y i• y k• qα(g, N g) t0 = | − | > − | | SE(y y ) √2 i• − k• In both the C.I. and the test, s  1 1  SE(y i• y k•) = MSE + . − ni nk

Chapter 5 - 13 Tukey’s HSD

To be significant at FWER = α based Tukey’s correction, the mean difference between a pair of treatments i and k must be at least

s   qα(g, N g) 1 1 − MSE + √2 × ni nk

This is called Tukey’s Honest Significant Difference (Tukey’s HSD).

R command to find qα(a, f ): qtukey(1-alpha,a,f)

> qtukey(0.95, 4, 96)/sqrt(2) [1] 2.614607

For the Beet Lice example, Tukey’s HSD is 2.6146 1.956 5.114 × ≈ Chemical ABCD ABCD y i• 12.00 14.96 18.36 24.00

Chapter 5 - 14 Tukey’s HSD in R

The TukeyHSD function only works for aov() model, not lm() model.

> aov1 = aov(licecount ~ ttt, data = beet) > TukeyHSD(aov1) Tukey multiple comparisons of means 95% family-wise confidence level

Fit: aov(formula = licecount ~ ttt, data = beet)

$ttt diff lwr upr p adj B-C -3.40 -8.5150601 1.71506 0.3099554 A-C -6.36 -11.4750601 -1.24494 0.0084911 D-C 5.64 0.5249399 10.75506 0.0246810 A-B -2.96 -8.0750601 2.15506 0.4337634 D-B 9.04 3.9249399 14.15506 0.0000695 D-A 12.00 6.8849399 17.11506 0.0000001

Chapter 5 - 15 5.5.1 Dunnett’s Procedure for Comparing with a Control

I Family: comparing ALL TREATMENTS with a CONTROL, µi µctrl, where µctrl is the mean of the control group − I Controls the (strong) FWER exactly at α for balanced designs (n1 = ... = ng ); approximately at α for unbalanced designs I Less conservative and greater power than Tukey-Kramer’s I 100(1 α)% Dunnett’s simultaneous C.I. for µi µctrl is − − s  1 1  y i• y control• dα(g 1, N g) MSE + − ± − − × ni nctrl

I For H0 : µi µctrl = 0 v.s. Ha : µi µctrl = 0, reject H0 if − − 6 y i• y a• t0 = | − | > dα(g 1, N g) | | r   − − MSE 1 + 1 × ni na

I The critical values dα(g 1, N g) can be found in Table D.9, p.635-638, of the textbook− − Chapter 5 - 16 5.3 Scheff`e’sMethod for Comparing All Contrasts Suppose there are g treatments in total. Consider a contrast Pg C = i=1 ωi µi . Recall v g u g 2 X u X ωi Cb = ωi y i•, SE(Cb) = tMSE × ni i=1 i=1

I The 100(1 α)% Scheff`e’ssimultaneous C.I. for all contrasts C is − q Cb (g 1)Fα,g−1,N−g SE(Cb) ± −

I For testing H0 : C = 0 v.s. Ha : C = 0, reject H0 when 6 Cb q t0 = | | > (g 1)Fα,g−1,N−g | | SE(Cb) −

Chapter 5 - 17 Scheff`e’sMethod for Comparing All Contrasts

I Most conservative (least powerful) of all tests. Protects against data snooping!

I Controls (strong) FWER at α, where the family is ALL POSSIBLE CONTRASTS

I Should be used if you have not planned contrasts in advance.

Chapter 5 - 18 Proof of Scheff`e’sMethod (1) Pg Because i=1 ωi = 0, observe that Xg Xg Cb = ωi y = ωi (y y ). i=1 i• i=1 i• − •• q P P 2 P 2 By the Cauchy-Schwartz Inequality ai bi a b and | | ≤ i i ωi let ai = and bi = √ni (y i• y ••), we get √ni −

g v g g u 2 X uX ωi X 2 Cb = ωi (y i• y ••) t ni (y i• y ••) | | − ≤ ni − i=1 i=1 i=1

Pg 2 Recall that SSTrt = ni (y y ) , we get the inequality i=1 i• − •• v u g 2 uX ωi Cb t SSTrt . | | ≤ ni i=1

Chapter 5 - 19 Proof of Scheff`e’sMethod (2) Cb Recall the t-statistic for testing H0: C = 0 is t0(C) = , and SE(Cb) q ω2 using the inequality C Pg i SS proved in the previous b i=1 ni Trt page, we have | | ≤

q ω2 Pg i SS r Cb Cb i=1 ni Trt SSTrt t0(C) = | | = | | = q ω2 q ω2 MSE | | SE(Cb) MSE Pg i ≤ MSE Pg i i=1 ni i=1 ni

MSTrt Recall F = MSE is the ANOVA F -statistic, we have r r SSTrt (g 1)MSTrt p t0(C) = − = (g 1)F . | | ≤ MSE MSE − We thus get a uniform upper bound for the t-statistic for any contrast C p t0(C) (g 1)F . | | ≤ − Chapter 5 - 20 Proof of Scheff`e’sMethod (3)

Recall that F has a F -distribution with g 1 and N g degrees of − − freedom, so P(F > Fα,g−1,N−g ) = α. p Since t0(C) < (g 1)F , we can see that | | −  q  FWER = P t0(C) > (g 1)Fα,g−1,N−g for any contrastC | | −   p q P (g 1)F > (g 1)Fα,g−1,N−g ≤ − −

= P(F > Fα,g−1,N−g ) = α.

Chapter 5 - 21 Example — Beet Lice Recall in Ch4, we tested a contrast comparing treatment A, B, C

(all liquid) with treatment D (powder) Chemical ABCD

y i• 12 14.96 18.36 24 µA + µB + µC C = µD . 3 − MSE = 47.8 which is estimated by y + y + y 12 + 14.96 + 18.36 Cb = A• B• C• y = 24 = 8.893 3 − D• 3 − − with standard error v u g r u X ω2 (1/3)2 (1/3)2 (1/3)2 ( 1)2 SE =tMSE i = 47.8( + + + − ) ni 25 25 25 25 i=1 r 4 = 47.8 = 1.597. × 75

Cb 8.893 t-statistic: t0 = = − = 5.568, with df = 100 4 = 96. SE 1.597 − − Chapter 5 - 22 Example — Beet Lice With Scheff`e’sMethod, the critical value controlling FWER at 0.05 is q q p (g 1)Fα,g−1,N−g = (4 1)F0.05,3,96 = (4 1) 2.699 2.846 − − − × ≈

> qf(0.05, df1=3, df2=96, lower.tail=F) [1] 2.699393 > sqrt((4-1)*qf(0.05, df1=3, df2=96, lower.tail=F)) [1] 2.84573 The critical value 2.846 for Scheff`e’smethod means that: if all treatments are equal, the contrast with the greatest t-statistic will exceed 2.846 for only 5% of the time. The magnitude of the t-statistic 5.568 for the contrast we considered is far above the critical value− 2.846. Conclusion: We can be certain that the contrast is really significant, even if the contrast was suggested by data snooping. Chapter 5 - 23 5.4.7 Fisher’s Least Significant Difference (LSD)

I The least significant difference (LSD) is the minimum amount by which two means must differ in order to be considered statistically different.

I LSD = the usual t-tests and t-intervals NO adjustment is made for multiple comparisons

I least conservative (most likely to reject) among all procedures, FWER can be large when family of tests is large

I too liberal, but greater power (more likely to reject)

Chapter 5 - 24 Summary of Multiple Comparison Adjustments

Critical Value to Method Family of Tests Keep FWER < α

Fisher’s LSD a single pairwise tα/2,N−g comparison

Dunnett all comparisons dα(g 1, N g) with a control − −

Tukey-Kramer all pairwise qα(g, N g)/√2 comparisons −

Bonferroni varies tα/(2k),N−g , where k = # of tests p Scheff`e all contrasts (g 1)Fα,g−1,N−g −

Chapter 5 - 25 Example — Beet Lice treatment ABCD Recall ni 25 25 25 25 , MSE = 47.84. y i• 12.00 14.96 18.36 24.00

q 1 1  q 2 SE(y i• y k•) = MSE + = 47.84 = 1.9563. − ni nk × 25 The critical values at α = 0.05 are > alpha = 0.05 > g = 4 > r = g*(g-1)/2 > N = 100 > qt(1-alpha/2, df = N-g) # Fisher’s LSD [1] 1.984984 > qt(1-alpha/2/r, df = N-g) # Bonferroni [1] 2.694028 > qtukey(1-alpha, g, df = N-g)/sqrt(2) # Tukey’s HSD [1] 2.614607 > sqrt((g-1)*qf(1-alpha, df1=g-1, df2=N-g)) # Scheffe [1] 2.84573 Chapter 5 - 26 The half widths of the C.I. are “critical values” SE, which are × Procedure LSD Tukey Bonferroni Scheffe C.I. half width 3.883 5.115 5.270 5.567

diff LSD Tukey Bonferroni Scheffe B-C -3.40 ( -7.28, 0.48) ( -8.51, 1.71) ( -8.67, 1.87) ( -8.97, 2.17) A-C -6.36 (-10.24, -2.48) (-11.47, -1.25) (-11.63, -1.09) (-11.93, -0.79) D-C 5.64 ( 1.76, 9.52) ( 0.53,10.75) ( 0.37,10.91) ( 0.07,11.21) A-B -2.96 ( -6.84, 0.92) ( -8.07, 2.15) ( -8.23, 2.31) ( -8.53, 2.61) D-B 9.04 ( 5.16,12.92) ( 3.93,14.15) ( 3.77,14.31) ( 3.47,14.61) D-A 12.00 ( 8.12,15.88) ( 6.89,17.11) ( 6.73,17.27) ( 6.43,17.57)

Chapter 5 - 27 Which Procedures to Use?

I Use BONFERRONI when only interested in a small number of planned contrasts (or pairwise comparisons)

I Use DUNNETT when only interested in comparing all treatments with a control

I Use TUKEY when only interested in all (or most) pairwise comparisons of means

I Use SCHEFFE when doing anything that could be considered data snooping – i.e. for any unplanned contrasts

Chapter 5 - 28 Significance Level vs. Power Most Least Powerful LSD Conservative x Dunnett   Tukey  Bonferroni  (for all pariwise comparisons) y Least Scheffe Most Powerful Conservative

In the figure above, Bonferroni is the Bonferroni for all pairwise comparisons. For a smaller family of, say k tests, one can divide α by k rather g(g−1) than by r = 2 . The resulting C.I. or tests may have stronger power than Tukey or Dunnett, will keeping FWER < α. Remember to use Bonferroni the contrasts should be pre-planned. Chapter 5 - 29 Multiple Comparisons in Balanced Block Designs All the multiple comparison procedures apply to all balanced block designs just change the degree of freedom from N g to the d.f. of MSE − Critical Value to Method Family of Tests Keep FWER < α

Fisher’s LSD a single pairwise tα/2,df of MSE comparison

Dunnett all comparisons dα(g 1, df of MSE) with a control −

Tukey-Kramer all pairwise qα(g, df of MSE)/√2 comparisons

Bonferroni all pairwise tα/(2r),df of MSE, g(g−1) comparisons where r = 2 p Scheff`e all contrasts (g 1)Fα,g−1,df of MSE − Chapter 5 - 30 Recall Example 13.1 (Mealybugs on Cycads)

I Treatment: water (control), fungal spores, and horticultural oil I 5 infested cycads, 3 branches are randomly chosen on each cycad, and 2 patches (3 cm 3 cm) are marked on each branch × I 3 branches on each cycad are randomly assigned to the 3 treatments I Response: difference of the # of mealybugs in the patches 13.2before The Randomized and 3 days Complete after treatments Block Design are applied 317 I As the patches are measurement units, we take the average of Tablethe two 13.1: patchesChanges on in mealybug each branch counts as on the cycads response after treatment. Treatments are water, Beauveria bassiana spores, and horticultural oil. Plant 1 2 3 4 5 Water -9 18 10 9 -6 -6 5 9 0 13 Spores -4 29 4 -2 11 7 10 -1 6 -1 Oil 4 29 14 14 7 11 36 16 18 15 Chapter 5 - 31

branches on each cycad are randomly assigned to the three treatments. After three days, the patches are counted again, and the response is the change in the number of mealybugs (before after). Data for this experiment are given in Table 13.1 (data from Scott Smith).− How can we decode the experimental design from the description just given? Follow the ! Looking at the randomization, we see that the treatments were applied to the branches (or pairs of patches). Thus the branches (or pairs) must be experimental units. Furthermore, the randomiza- tion was done so that each treatment was applied once on each cycad. There was no possibility of two branches from the same plant receiving the same treatment. This is a restriction on the randomization, with cycads acting as blocks. The patches are measurement units. When we analyze these data, we can take the average or sum of the two patcheson each branch as the response for the branch. To recap, there were g = 3 treatments applied to N = 15 units arranged in r = 5 blocks of size 3 according to an RCB design; there were two measurement units per experimental unit. Why did the experimenter block? Experience and intuition lead the ex- perimenter to believe that branches on the same cycad will tend to be more alike than branches on different cycads—genetically, environmentally, and perhaps in other ways. Thus by plant may be advantageous. It is important to realize that tables like Table 13.1 hide the randomization that has occurred. The table makes it appear as though the first unit in every block received the water treatment, the second unit the spores, and so on. This is not true. The table ignores the randomization for the convenience of a readable display. The water treatment may have been applied to any of the three units in the block, chosen at random. You cannot determine the design used in an experiment just by looking at a table of results, you have to know the randomization. There may be many Example 13.1 (Mealybugs on Cycads) Treatment Water Spore Oil MSE = 17.725 y i• 4.3 5.9 16.4 df of MSE = 8 The SE for pairwise comparison is s s 1 1 1 1 MSE + = 17.725 + 2.663. r r 5 5 ≈ Tukey’s critical value is 2.857. > qtukey(0.95, 3, df = 8)/sqrt(2) [1] 2.857444 Tukey’s HSD controlling FWER at 0.05 is 2.857 2.663 7.608. × ≈ Water Spore Oil We see that spores treatment cannot be distinguished from the control (water) (their mean did not differ by more than 7.608), but both can be distinguished from the oil treatment. Chapter 5 - 32 Example 13.1 (Mealybugs on Cycads)

> aov1 = aov(avechange ~ trt + as.factor(plant), data=cycad) > TukeyHSD(aov1)

Tukey multiple comparisons of means 95% family-wise confidence level

$trt I Tukey’s HSD at 5% diff lwr upr p adj Spore-Water 1.6 -6.008532 9.208532 0.8235730 level for pairwise Oil-Water 12.1 4.491468 19.708532 0.0047478 comparisons of the 3 Oil-Spore 10.5 2.891468 18.108532 0.0105848 treatments agrees $‘as.factor(plant)‘ with our computation diff lwr upr p adj 2-1 20.666667 8.790833 32.5425005 0.0021283 3-1 8.166667 -3.709167 20.0425005 0.2154812 I Tukey’s HSD for 4-1 7.000000 -4.875834 18.8758339 0.3302742 5-1 6.000000 -5.875834 17.8758339 0.4607553 pairwise comparisons 3-2 -12.500000 -24.375834 -0.6241661 0.0390953 of the 5 plants is 4-2 -13.666667 -25.542501 -1.7908328 0.0248443 nonsense here. 5-2 -14.666667 -26.542501 -2.7908328 0.0169882 4-3 -1.166667 -13.042501 10.7091672 0.9965298 5-3 -2.166667 -14.042501 9.7091672 0.9657205 5-4 -1.000000 -12.875834 10.8758339 0.9980873 Chapter 5 - 33 Tukey-Kramer for BIBD

Recall for BIBD, the estimate of αi αi is 1 − 2 k αi αi = (Qi Qi ) b 1 − b 2 λg 1 − 2

1 P where Qi = yi• k j Iij y•j and Iij = 1 if treatment i appears in block j, or 0 otherwise.− s  2k  I SE(αi αi ) = MSE b 1 − b 2 λg

αi1 αi2 I t-statistic = b − b with df = df of MSE SE

I Tukey-Kramer: reject H0: αi1 = αi2 if

t > qα(g, df of MSE)/√2. | |

Chapter 5 - 34 Recall382 Problem 14.3 — Exam Grading Incomplete Block Designs Exam Grader Score Exam Grader Score 1 1 2 3 4 5 6059516453 16 1 9122023 6167696865 2 6 7 8 910 6469636371 17 210131624 7875767572 3 1112131415 8485868583 18 3 6141725 6772727576 4 1617181920 7276777477 19 4 7151821 8481767977 5 2122232425 6573707170 20 5 8111922 8184858481 6 1 6111621 5254625455 21 1 8151724 7065616666 7 2 7121722 5651525751 22 2 9111825 8482868586 8 3 8131823 5560596061 23 310121921 7285778279 9 4 9141924 8876777774 24 4 6132022 8575788283 10 510152025 6568727477 25 5 7141623 5864585758 11 110141822 7977777779 26 1 7131925 6671737070 12 2 6151923 7066636266 27 2 8142021 7367637066 13 3 7112024 4849514850 28 3 9151622 5870696171 14 4 8121625 7564756865 29 410111723 9584888887 15 5 9131721 7977817983 30 5 6121824 4747514956 Analyze these data to determine if graders differ, and if so, how. Be sure to I g = 25 gradersdescribe the (treatments) design. ProblemI b 14.4= 30 examsThirty (blocks) consumers are asked to rate the softness of clothes washed by ten different detergents, but each consumer rates only four different detergents. I Each exam was graded by 5 graders (size of block k = 5) The design and responses are given below: I Each grader graded 6 exams (number of replicates per treatment r = 6) Trts Softness Trts Softness 1 ABCD 37233741 16 ABCD 52414548 I Every pair of graders graded 1 exam in common ( = 1) 2 ABEF 35323937 17 ABEF 46424542λ 3 ACGHChapter 39453941 5 - 35 18 ACGH 44434136 4 ADIJ 44424644 19 ADIJ 32423629 5 AEG I 44444550 20 AEG I 43424444 6 AFHJ 55455349 21 AFHJ 46414345 7 BCF I 47504852 22 BCF I 43514042 8 BDGJ 37424037 23 BDGJ 38373634 9 BEHJ 32343929 24 BEHJ 40494344 10 BGH I 36413943 25 BGH I 23202729 11 CEIJ 45444036 26 CEIJ 46494843 12 CFGJ 42383939 27 CFGJ 48434841 13 CDEH 47484647 28 CDEH 35353126 14 DEFG 43474841 29 DEFG 45474742 15 DFH I 39323231 30 DFH I 43393839

Analyze these data for treatment effects and report your findings. Problem 14.3 — Exam Grading How to identify inconsistent graders? > aov1 = aov(score ~ exam + grader, data=pr14.3) > tukey.grader = TukeyHSD(aov1, "grader"); tukey.grader Tukey multiple comparisons of means 95% family-wise confidence level

Fit: aov(formula = score ~ exam + grader, data = pr14.3)

$grader diff lwr upr p adj 2-1 3.400000e+00 -2.4258844 9.225884361 8.736401e-01 3-1 -4.600000e+00 -10.4258844 1.225884361 3.471559e-01 4-1 6.933333e+00 1.1074490 12.759217694 4.692768e-03 5-1 -2.200000e+00 -8.0258844 3.625884361 9.992897e-01 6-1 -1.266667e+00 -7.0925510 4.559217694 1.000000e+00 7-1 2.033333e+00 -3.7925510 7.859217694 9.997962e-01 (... omitted ...) 25-23 1.200000e+00 -4.6258844 7.025884361 1.0000000 25-24 9.666667e-01 -4.8592177 6.792551027 1.0000000 There are 25 = 300 pairwise comparisons. 2 Chapter 5 - 36 > tukey.grader = TukeyHSD(aov1)$grader > padj = tukey.grader[,4] > tukey.grader[padj < 0.05,] diff lwr upr p adj 4-1 6.933333 1.1074490 12.759217694 4.692768e-03 3-2 -8.000000 -13.8258844 -2.174115639 3.306253e-04 4-3 11.533333 5.7074490 17.359217694 1.196388e-08 7-3 6.633333 0.8074490 12.459217694 9.319204e-03 11-3 7.100000 1.2741156 12.925884361 3.165681e-03 12-3 6.400000 0.5741156 12.225884361 1.554940e-02 13-3 5.933333 0.1074490 11.759217694 4.061621e-02 17-3 6.333333 0.5074490 12.159217694 1.793168e-02 20-3 6.800000 0.9741156 12.625884361 6.389338e-03 22-3 6.566667 0.7407823 12.392551027 1.080881e-02 25-3 6.400000 0.5741156 12.225884361 1.554940e-02 5-4 -9.133333 -14.9592177 -3.307448973 1.489373e-05 6-4 -8.200000 -14.0258844 -2.374115639 1.948518e-04 8-4 -7.533333 -13.3592177 -1.707448973 1.095219e-03 9-4 -7.166667 -12.9925510 -1.340782306 2.698100e-03 10-4 -5.833333 -11.6592177 -0.007448973 4.929305e-02 14-4 -7.566667 -13.3925510 -1.740782306 1.007216e-03 15-4 -7.566667 -13.3925510 -1.740782306 1.007216e-03 16-4 -8.400000 -14.2258844 -2.574115639 1.138680e-04 18-4 -6.066667 -11.8925510 -0.240782306 3.115872e-02 19-4 -6.566667 -12.3925510 -0.740782306 1.080881e-02 21-4 -7.266667 -13.0925510 -1.440782306 2.117805e-03 23-4 -6.333333 -12.1592177 -0.507448973 1.793168e-02 24-4 -6.100000 -11.9258844 -0.274115639 2.912592e-02

Only Grader #3 and # 4 are significantly inconsistent, after Tukey’s adjustment. Chapter 5 - 37