252y0772 11/26/07 (Page layout view!)

ECO252 QBA2 Name __Key______THIRD EXAM Student number______November 29, 2007 Class Day and hour______Version 2

I. (8 points) Do all the following (2 points each unless noted otherwise). Make Diagrams! Show your work! All probabilities must be between zero and (positive) 1.

x ~ N27,14 46  27 75  27  1. P46  x  77  P  z    P1.36  z  3.43  P0  z  3.57 P0  z  1.36  14 14   .4998 .4131  .0867 For z make a diagram. Draw a Normal curve with a mean at 0. Indicate the mean by a vertical line! Shade the area between 1.36 and 3.43. Because this is on one side of zero we must subtract the area between zero and 1.36 from the larger area between zero and 3.43. If you wish, make a completely separate diagram for x . Draw a Normal curve with a mean at 27. Indicate the mean by a vertical line! Shade the area between 46 and 77. This area is on one side of the mean (27) so we subtract to get our answer.

21 27 39  27  2. P21  x  39  P  z    P0.43  z  0.86  P0.43  z  0 P0  z  0.86  14 14  .1664  .3051  .4715 For z make a diagram. Draw a Normal curve with a mean at 0. Indicate the mean by a vertical line! Shade the entire area between -0.43 and 0.86. Because this is on both sides of zero we must add the area between -0.43 and zero to the area between zero and 0.86. If you wish, make a completely separate diagram for x . Draw a Normal curve with a mean at 27. Indicate the mean by a vertical line! Shade the entire area between 21 and 39. This area is on both sides of the mean (27) so we add to get our answer.

 0  27  3. Px  0  Pz    Pz  1.93  P1.93  z  0 Pz  0  .4732  .5  .9732  14  For z make a diagram. Draw a Normal curve with a mean at 0. Indicate the mean by a vertical line! Shade the entire area above -1.86. Because this is on both sides of zero we must add the area between -1.86 and zero to the area above zero. If you wish, make a completely separate diagram for x . Draw a Normal curve with a mean at 27. Indicate the mean by a vertical line! Shade the entire area above zero, remembering that zero is below the mean. This area is on both sides of the mean (27) so we add to get our answer.

4. x.08 (Do not try to use the t table to get this.) For z make a diagram. Draw a Normal curve with a nd mean at 0. z.08 is the value of z with 8% of the distribution above it. Since 100 – 8 = 92, it is also the 92 percentile. Since 50% of the standardized Normal distribution is below zero, your diagram should show that the probability between z.08 and zero is 92% - 50% = 42% or P0  z  z.08   .4200. The closest we can come to this is P0  z  1.41  .4203.(1.40 is also pretty close.) So z.08  1.41 (or something slightly x   smaller). To get from z.08 to x.08 , use the formula x    z , which is the opposite of z  .  x  27 1.4114  46.74 . If you wish, make a completely separate diagram for x . Draw a Normal curve with a mean at 27. Show that 50% of the distribution is below the mean (27). If 8% of the distribution is above x.08 , it must be above the mean and have 42% of the distribution between it and the mean.  46.74  27  Check: Px  46.74  Pz    14   Pz  1.41  Pz  0 P0  z  1.41  .5 .4207  .0793  .08

1 252y0772 11/26/07 (Page layout view!)

II. (22+ points) Do all the following (2 points each unless noted otherwise). Do not answer a question ‘yes’ or ‘no’ without giving reasons. Show your work when appropriate. Use a 5% significance level except where indicated otherwise. Note that this is extremely long and that no one will do all the problems, so look them over!

1. Turn in your computer problems 2 and 3 marked as requested in the Take-home. (5 points, 2 point penalty for not doing.)

2. In an ordinary 1-way ANOVA, if the computed F statistic exceeds the value from the F table at the given significance level, we can a. Reject the null hypothesis because the difference between the means is not significant b.* Reject the null hypothesis because there is evidence of a significant difference between some of the means. c. Not reject the null hypothesis because the difference between the means is not significant. d. Not reject the null hypothesis because the difference between the means is significant. c. Not reject the null hypothesis because the difference between the variances is not significant. d. Not reject the null hypothesis because the difference between the variances is significant. e. None of the above. [7] 3. After an analysis if variance, you would use the Tukey-Kramer procedure or similar confidence intervals to check a. For Normality b. For equality of variances c. For independence of error terms d.* For pairwise differences in means e. For all of the above f. For none of the above

4. If an ordinary one-way ANOVA has 25 columns 17 rows and1725  425 , the degrees of freedom for the F test are a. 400 and 24 b. 408 and 16 c.* 24 and 400 d. 16 and 408 e. 400 and 424 f. 408 and 424 g. 424 and 400 h. 424 and 408 i. 16 and 24 j. None of the above. The correct answer is ______. Explanation: This is a one-way ANOVA. The total number of observations is n  425 and the number of columns is m  25 . This means there are 425-1 = 424 total degrees of freedom and that between the columns there are 25-1 = 24 degrees of freedom. This leaves 424 – 24 = 400 degrees of freedom for the error (within) term. Numbers are filled in below. Source SS DF MS F F SSB MSB F 24,400  1.40 Between SSB m 1  24 MSB  F  .01 m 1 MSW Within SSW SSW n  m  400 MSW  n  m Total SST n 1  424

2 252y0772 11/26/07 (Page layout view!)

5. Assuming that your answer to 4 is correct and that the significance level is 1%, the correct value of F from the table is __1.84___. (This may have to be approximate. If so, what did you use?) (1) [12] Note: I will check your answer against what you said in the previous question. The answer above is 24,400 wrong if you did not say something close to F.01 .

Exhibit 1 A manager believes that the number of sales that an employee makes is related to the number of years worked and their score on an aptitude test. He runs the data below on Minitab and gets the following Employee Sales Years Score 1 110 11 70 2 100 4 100 3 90 9 90 4 80 6 40 5 70 6 80 6 60 8 50 7 50 2 40 8 40 2 10

MTB > regress c1 2 c2 c3 Regression Analysis: Sales versus Years, Score The regression equation is Sales = 28.2 + 3.10 Years + 0.470 Score Predictor Coef SE Coef T P Constant 28.19 13.87 2.03 0.098 Years 3.103 1.984 1.56 0.179 Score 0.4698 0.2133 2.20 0.079 S = 15.1088 R-Sq = 72.8% R-Sq(adj) = 62.0%

Analysis of Variance Source DF SS MS F P Regression 2 3058.6 1529.3 6.70 0.038 Residual Error 5 1141.4 228.3 Total 7 4200.0

Source DF Seq SS Years 1 1951.4 Score 1 1107.3 The sum of the sales column is 600 and the sum of the squared numbers in the sales column is not needed. The sum of the 'years' column is 48 and the sum of the squared numbers in the years column is 362. The sum of the score column is 480 and the sum of the squared numbers in the score column is 35200 If Sales is the dependent variable and years and score are the independent variables we have found that the sum of x1y is 3980 and the sum of x1 x2 is 3200. The sum of x2y has not been computed.

6. In the multiple regression, what coefficients are significant at the 10% significance level? (2) Solution: Only 0.4698, the coefficient of ‘score’ has a p-value below .10.

7. In the multiple regression, what coefficients are significant at the 5% significance level? (1) [15] Solution: None

8. Assuming that the coefficients in the multiple regression are correct, how many sales would we predict for someone with 9 years of experience and a score of 90? (1) Sales = 28.19 + 3.103 Years + 0.4698 Score = 28.19 + 3.103(9) + 0.4698(90) = 28.19 + 28.17 + 42.28 = 98.64

9. Using the information in the multiple regression printout, make your result in 8) into a rough prediction interval. (2)

3 252y0772 11/26/07 (Page layout view!)

ˆ Solution: The outline says that an approximate prediction interval Y0  Y0  tse . Remember df  n  k 1  10  2 1  5 . In the printout se  S = 15.1088  MSE  228.3 . So 5 Y0  98.64  t.025 15.1088  98.64  2.57115.1088  98.64  38.84 Exhibit 1 A manager believes that the number of sales that MTB > regress c1 2 c2 c3 an employee makes is related to the number of years worked Regression Analysis: Sales versus Years, Score and their score on an aptitude test. He runs the data below on The regression equation is Minitab and gets the following Sales = 28.2 + 3.10 Years + 0.470 Score Employee Sales Years Score Predictor Coef SE Coef T P 1 110 11 70 Constant 28.19 13.87 2.03 0.098 2 100 4 100 Years 3.103 1.984 1.56 0.179 3 90 9 90 Score 0.4698 0.2133 2.20 0.079 4 80 6 40 S = 15.1088 R-Sq = 72.8% R-Sq(adj) = 5 70 6 80 62.0% 6 60 8 50 7 50 2 40 Analysis of Variance 8 40 2 10 Source DF SS MS F P Regression 2 3058.6 1529.3 6.70 0.038 Residual Error 5 1141.4 228.3 Total 7 4200.0

Source DF Seq SS Years 1 1951.4 Score 1 1107.3 The sum of the sales column is 600 and the sum of the squared numbers in the sales column is not needed. The sum of the 'years' column is 48 and the sum of the squared numbers in the years column is 362. The sum of the score column is 480 and the sum of the squared numbers in the score column is 35200 If Sales is the dependent variable and years and score are the independent variables we have found that the sum of x1y is 3980 and the sum of x1 x2 is 3200. The sum of x2y has not been computed.

10. Using the information in the printout, what is the value of R-squared for a regression of ‘sales’ against ‘years’ alone? (2) [20] Solution: Looking at the sequential Sum of squares the regression sum of squares is 1951.4 for ‘Years’ 1951.4 alone. The total sum of squares is 4200.0, so we have R 2   .4646 4200.0 11. Do a simple regression of ‘sales’ against ‘score’ alone. Before you do something ridiculous see 252blunders! a) Compute the sum  xy that you will need for this regression. Show your work! (2) Don’t compute stuff that has already been done for you! Solution: The only column that you should have computed is in bold below.

Row Sales Years Score Ysq x1sq x2sq x1y x2y x1x2 1 110 11 70 12100 121 4900 1210 7700 770 2 100 4 100 10000 16 10000 400 10000 400 3 90 9 90 8100 81 8100 810 8100 810 4 80 6 40 6400 36 1600 480 3200 240 5 70 6 80 4900 36 6400 420 5600 480 6 60 8 50 3600 64 2500 480 3000 400 7 50 2 40 2500 4 1600 100 2000 80 8 40 2 10 1600 4 100 80 400 20 600 48 480 49200 362 35200 3980 40000 3200

b) It says that you do not need to know the sum of squares in the sales column. You do 2 2 however need the spare part SS y  Y  nY . Without doing any computing, tell what its value is. (1) Solution: The ANOVA in the computer output says that the total sum of squares is 4200.0. 2 2 2  600  Of course, if you like to waste time SS y  Y  nY  49200  8   4200 . (Y  75 )  8 

4 252y0772 11/26/07 (Page layout view!)

Exhibit 1 A manager believes that the number of sales that MTB > regress c1 2 c2 c3 an employee makes is related to the number of years worked Regression Analysis: Sales versus Years, Score and their score on an aptitude test. He runs the data below on The regression equation is Minitab and gets the following Sales = 28.2 + 3.10 Years + 0.470 Score Employee Sales Years Score Predictor Coef SE Coef T P 1 110 11 70 Constant 28.19 13.87 2.03 0.098 2 100 4 100 Years 3.103 1.984 1.56 0.179 3 90 9 90 Score 0.4698 0.2133 2.20 0.079 4 80 6 40 S = 15.1088 R-Sq = 72.8% R-Sq(adj) = 5 70 6 80 62.0% 6 60 8 50 7 50 2 40 Analysis of Variance 8 40 2 10 Source DF SS MS F P Regression 2 3058.6 1529.3 6.70 0.038 Residual Error 5 1141.4 228.3 Total 7 4200.0

Source DF Seq SS Years 1 1951.4 Score 1 1107.3 The sum of the sales column is 600 and the sum of the squared numbers in the sales column is not needed. The sum of the 'years' column is 48 and the sum of the squared numbers in the years column is 362. The sum of the score column is 480 and the sum of the squared numbers in the score column is 35200 If Sales is the dependent variable and years and score are the independent variables we have found that the sum of x1y is 3980 and the sum of x1 x2 is 3200. The sum of x2y has not been computed.

ˆ c) Compute the coefficients of the equation Y  b0  b2 x to predict the value of ‘sales’ on the basis of ‘score.’ (4) [27] Solution: First copy n  8,  X  480, Y  600, (you computed )  XY  40000,  X 2  35200 and Y 2 is not needed. (It’s 49200.) X 480 Y 600 Then compute means: X     60 Y     75 . n 8 n 8 2 2 2 The ‘Spare Parts’ are as follows: SS x   X  nX  35200 860  6400 2 2 You already found SS y  Y  nY  4200  SST (Total Sum of Squares) . S xy   XY  nXY  40000  86075  4000 .

S xy XY  nXY 4000 b      0.625 b  Y  b X  75  0.625 60  37.5 So 1 2 2 and 0 1     , which SS x  X  nX 6400 means Yˆ  37.5  0.625X or Y  37.5  0.625X  e

d) Compute R 2 . (3) 2 SSR 2500 Solution: SSR  b1S xy  0.6254000  2500 . We can say R    .59524 or SST 4200 2 2 2 b1S xy 0.6254000 S xy 4000 R    .59524 or R 2    .59524 SSy 4200 SS x SS y 64004200

5 252y0772 11/26/07 (Page layout view!)

Exhibit 1 A manager believes that the number of sales that MTB > regress c1 2 c2 c3 an employee makes is related to the number of years worked Regression Analysis: Sales versus Years, Score and their score on an aptitude test. He runs the data below on The regression equation is Minitab and gets the following Sales = 28.2 + 3.10 Years + 0.470 Score Employee Sales Years Score Predictor Coef SE Coef T P 1 110 11 70 Constant 28.19 13.87 2.03 0.098 2 100 4 100 Years 3.103 1.984 1.56 0.179 3 90 9 90 Score 0.4698 0.2133 2.20 0.079 4 80 6 40 S = 15.1088 R-Sq = 72.8% R-Sq(adj) = 5 70 6 80 62.0% 6 60 8 50 7 50 2 40 Analysis of Variance 8 40 2 10 Source DF SS MS F P Regression 2 3058.6 1529.3 6.70 0.038 Residual Error 5 1141.4 228.3 Total 7 4200.0

Source DF Seq SS Years 1 1951.4 Score 1 1107.3 The sum of the sales column is 600 and the sum of the squared numbers in the sales column is not needed. The sum of the 'years' column is 48 and the sum of the squared numbers in the years column is 362. The sum of the score column is 480 and the sum of the squared numbers in the score column is 35200 If Sales is the dependent variable and years and score are the independent variables we have found that the sum of x1y is 3980 and the sum of x1 x2 is 3200. The sum of x2y has not been computed. e) Is the slope of the simple regression significant at the 5% level? Do not answer this question without appropriate calculations! (4) Solution: We can compute SSE  SST  SSR  4200  2500  1700 . Then

2 SSE 1700 2 SS y  b1S xy 4200  0.6254000 se    283.3333 or s    283.3333 n  2 6 e n  2 6  1  283.3333 s 2  s 2   .04427 s  .04427  0.2104 se  283.3333  16.83251 . So b1 e   and b . 1  SS x  6400

H 0 : 1  0 b1  0 The outline says to test use t  and if the null hypothesis is false in that case we  s H1 : 1  0 b1 6 say that 1 is significant. So our ‘do not reject’ zone is between  t.025  2.447 if   .05 . Our 0.625  0 calculated t   2.971 is outside both these zones, so that there is no doubt that the coefficient .2104 is significant.

f) Predict the number of sales an individual with a score of 90 will make and make your estimate into an appropriate 95% interval. (4) Solution: We found Yˆ  37.5  0.625X  37.5  0.62590  93.75 . The outline says that the  2  2 2  1 X 0  X   prediction Interval is Y  Yˆ  ts , where s  s  1 , X  60 , 0 0 Y Y e  n SS   x   1 90  602  2 SS  6400 s 2  283.3333  1 se  283.3333 and x . So Y    8 6400   283.33330.12500  0.140631  283.33331.26563  358.5937 . We will use 6 ˆ  93.75  2.447 18.9366 sYˆ  358.5937  18.9366 and t.025  2.447 . So Y0  Y0  tsY    93.75  46.34 or 47.41 to 140.09.

6 252y0772 11/26/07 (Page layout view!)

Exhibit 1 A manager believes that the number of sales that MTB > regress c1 2 c2 c3 an employee makes is related to the number of years worked Regression Analysis: Sales versus Years, Score and their score on an aptitude test. He runs the data below on The regression equation is Minitab and gets the following Sales = 28.2 + 3.10 Years + 0.470 Score Employee Sales Years Score Predictor Coef SE Coef T P 1 110 11 70 Constant 28.19 13.87 2.03 0.098 2 100 4 100 Years 3.103 1.984 1.56 0.179 3 90 9 90 Score 0.4698 0.2133 2.20 0.079 4 80 6 40 S = 15.1088 R-Sq = 72.8% R-Sq(adj) = 5 70 6 80 62.0% 6 60 8 50 7 50 2 40 Analysis of Variance 8 40 2 10 Source DF SS MS F P Regression 2 3058.6 1529.3 6.70 0.038 Residual Error 5 1141.4 228.3 Total 7 4200.0

Source DF Seq SS Years 1 1951.4 Score 1 1107.3 The sum of the sales column is 600 and the sum of the squared numbers in the sales column is not needed. The sum of the 'years' column is 48 and the sum of the squared numbers in the years column is 362. The sum of the score column is 480 and the sum of the squared numbers in the score column is 35200 If Sales is the dependent variable and years and score are the independent variables we have found that the sum of x1y is 3980 and the sum of x1 x2 is 3200. The sum of x2y has not been computed.

g) Do an analysis of variance using your SST, SSE and SSR for this equation or using 1, R 2 and 1 R 2 . What have you already done that makes this table redundant? If you don’t know what redundant means, ask! (3) [43] Solution: We actually have almost all this done. We have already found SS y  4200  SST ,

SSR  b1S xy  0.6254000  2500 and SSE  SST  SSR  4200  2500  1700 . So our ANOVA table will be as below.

Source SS DF MS F F Regression 2500 1 2500 8.824 1,6 F.05  5.99 Error 1700 6 283.3333 Total 4200 7 If we recall R 2  .59524 for this regression, we can rewrite the table as below. 2 Source R DF ‘MS’ F F Regression 0.59524 1 0.59524 8.824 1,6 F.05  5.99 Error 0.40476 6 0.06746 Total 1.00000 7 Just for reassurance, here is the Minitab output. Regression Analysis: Sales versus Score The regression equation is Sales = 37.5 + 0.625 Score Predictor Coef SE Coef T P Constant 37.50 13.96 2.69 0.036 Score 0.6250 0.2104 2.97 0.025 S = 16.8325 R-Sq = 59.5% R-Sq(adj) = 52.8%

Analysis of Variance Source DF SS MS F P Regression 1 2500.0 2500.0 8.82 0.025 Residual Error 6 1700.0 283.3 Total 7 4200.0

7 252y0772 11/26/07 (Page layout view!)

This is redundant because we have already shown that the coefficient of ‘Score’ is significant. Because there is only one independent variable, this shows the same thing.

Exhibit 1 A manager believes that the number of sales that MTB > regress c1 2 c2 c3 an employee makes is related to the number of years worked Regression Analysis: Sales versus Years, Score and their score on an aptitude test. He runs the data below on The regression equation is Minitab and gets the following Sales = 28.2 + 3.10 Years + 0.470 Score Employee Sales Years Score Predictor Coef SE Coef T P 1 110 11 70 Constant 28.19 13.87 2.03 0.098 2 100 4 100 Years 3.103 1.984 1.56 0.179 3 90 9 90 Score 0.4698 0.2133 2.20 0.079 4 80 6 40 S = 15.1088 R-Sq = 72.8% R-Sq(adj) = 5 70 6 80 62.0% 6 60 8 50 Analysis of Variance 7 50 2 40 Source DF SS MS F 8 40 2 10 P Regression 2 3058.6 1529.3 6.70 0.038 Residual Error 5 1141.4 228.3 Total 7 4200.0 Source DF Seq SS Years 1 1951.4 Score 1 1107.3 The sum of the sales column is 600 and the sum of the squared numbers in the sales column is not needed. The sum of the 'years' column is 48 and the sum of the squared numbers in the years column is 362. The sum of the score column is 480 and the sum of the squared numbers in the score column is 35200 If Sales is the dependent variable and years and score are the independent variables we have found that the sum of x1y is 3980 and the sum of x1 x2 is 3200. The sum of x2y has not been computed. h) Using the information on Regression Sums of squares or R 2 and 1 R 2 in the ANOVA that you just did and from the multiple regression, do an F test to see if adding ‘years’ to the regression of ‘sales’ against ‘score’ is worthwhile. Do not waste our time by repeating stuff that has already been done. (3) [46] Solution: In view of the original printout, we can rewrite out ANOVA tables above for the multiple regression. So our ANOVA table will be as below.

Source SS DF MS F F Regression 3058.6 2 1529.3 6.70 Error 1141.4 5 226.3 Total 4200 7 If we recall R 2  .7280 for this regression, we can rewrite the table as below. 2 Source R DF ‘MS’ F F Regression 0.7280 2 0.3640 6.691 Error 0.2720 5 0.0544 Total 1.0000 7 For the regression against ‘Score’ alone, we had R 2  .59524 and SSR  2500 . So that if we itemize the regressions above, we get the tables below.

Source SS DF MS F F Regression 3058.6 2 Score 2500.0 1 F 1,5  6.61 Years 558.6 1 558.6 2.468 .05 Error 1141.4 5 226.3 Total 4200 7 If we use R 2 instead for this regression, we can rewrite the table as below. 2 Source R DF ‘MS’ F F Regression 0.7280 2 Score 0.5952 1 F 1,5  6.61 Years 0.1328 1 0.1328 2.441 .05 Error 0.2720 5 0.0544 Total 1.0000 7

8 252y0772 11/26/07 (Page layout view!)

In spite of the gigantic rounding error in the table using R 2 , the results are the same as in the t- test on the coefficient of ‘years’ in the second regression, the calculated F is below the table F so that adding ‘years’ does not significantly improve the results. Exhibit 2 (Groebner) A product is being produced on 3 different lines using 3 different layouts for the lines. A sample of 36 observations are taken on various days over a period of four weeks so that there are 12 observations for the daily output for each line evenly divided between the three possible layouts. Assume   .05 . MTB > Twoway c5 c2 c3; SUBC> Means c2 c3. Two-way ANOVA: output 2 versus line, layout Source DF SS MS F P line 2 133.4 66.7 0.16 0.849 layout 2 29257.1 14628.5 36.06 0.000 Interaction _ 2119.1 529.8 1.31 0.293 Error ______Total 35 42461.6 S = 20.14 R-Sq = 74.21% R-Sq(adj) = 66.56%

Individual 95% CIs For Mean Based on Pooled StDev line Mean -----+------+------+------+---- 1 131.833 (------*------) 2 127.750 (------*------) 3 127.750 (------*------) -----+------+------+------+---- 119.0 126.0 133.0 140.0 Individual 95% CIs For Mean Based on Pooled StDev layout Mean ----+------+------+------+----- 1 116.083 (---*----) 2 168.667 (---*----) 3 102.583 (----*----) ----+------+------+------+----- 100 125 150 175

12. Fill in the missing degrees of freedom, the missing sum of squares and the missing mean square. (2) Solution: We can find the degrees of freedom by multiplying the degrees of freedom for the factors that interact. The error degrees of freedom are whatever is needed to make the column add up and the mean squares are found by dividing the sums of squares by degrees of freedom. The error sum of squares is whatever makes the SS column add up. The MS is SS divided by DF. The corrected table reads as below. Two-way ANOVA: output 2 versus line, layout Source DF SS MS F P line 2 133.4 66.7 0.16 0.849 layout 2 29257.1 14628.5 36.06 0.000 Interaction 4 2119.1 529.8 1.31 0.293 Error 27 10952.0 405.630 Total 35 42461.6 S = 20.14 R-Sq = 74.21% R-Sq(adj) = 66.56%

13. Is there significant interaction between ‘line’ and ‘layout’? Don’t answer unless you can tell me what the evidence is. (2) 4,27 Solution: We can look up F.05  2.73 if we like to work. It is larger than the computed F of 1.31. Or we can simply note that since the p-value of 0.293 is well above any significance level that we are likely to use, we cannot reject the null hypothesis that the interaction is insignificant.

14. Is the difference between layouts significant? Why?(1)

9 252y0772 11/26/07 (Page layout view!)

2,27 Solution: We can look up F.05  3.35 if we like to work. It is smaller than the computed F of 36.06. Or we can simply note that since the p-value of zero is well below any significance level that we are likely to use, we can reject the null hypothesis that the difference between the line means is insignificant.

15. Do a confidence interval of your choice for the difference between layout 1 and layout 3. Tell what kind of interval you are using , what its characteristics are and whether it shows a significant difference. (4)[55] Solution: The outline gives us a choice. There are 3 rows, 3 columns and 4 observations per cell. R  3,

C  3, P  4, x1..  116.083, x3..  102.583 and RCP 1  333  27 . The error (within) mean square is MSW  405.630 . x1..  x3..  116.083 102.583  13.500 . 2MSW 2405.630   67.6050  8.2222 PC 12

i. A Single Confidence Interval If we desire a single interval we use the formula for a Bonferroni Confidence Interval below with

RCP1 2MSW m  1. For rows this would be 1  3  x1  x3  t 2 PC 27  13.500  t.025 8.2222  13.500  2.0528.2222  13.500 16.872 ii. Scheffé Confidence Interval If we desire intervals that will simultaneously be valid for a given confidence level for all possible intervals between means, for row means, we use 2MSW     x  x  R 1F R1,RCP1  13.500  2F 2,27 8.2222 1 2 1 2  PC .05  13.500  23.358.2222  13.500  21.283 iii. Bonferroni Confidence Interval

RCP1 2MSW If we use this for row means 1   2  x1  x2  t  , but it is usually 2m PC impractical.

iv. Tukey Confidence Interval This is of similar meaning to the Scheffé. For row means, we use MSW     x  x  q R,RCP1  13.500  q 3,27 8.2222 1 2 1 2  PC   13.500  3.518.2222  13.500  28.860 . I’m suspicious. The Tukey interval should be smaller than the Scheffé. Part of the Tukey table appears below. m  2 3 4 5 6 7 8 9 10

df  24 * * * * * * * * * * * * * 0.05 2.92 3.53 3.90 4.17 4.37 4.54 4.68 4.81 4.92 0.01 3.96 4.55 4.91 5.17 5.37 5.54 5.69 5.81 5.92 df  30 * * * * * * * * * * * * * 0.05 2.89 3.49 3.85 4.10 4.30 4.46 4.60 4.63 4.73 0.01 3.89 4.45 4.80 5.05 5.24 5.40 5.54 5.65 5.76 3,27 3,24 3,30 It looks to me as if q ought to be halfway between q.05  3.53 and q.05  3.49 , so 3,27 q  3.51. Note that, since all of these intervals include zero, none of these differences is significant.

10 252y0772 11/26/07 (Page layout view!)

11 252y0772 11/26/07 (Page layout view!)

16. (Groebner) An industrial firm analyses the amount of breakage (in dollar cost) that occurs using 3 different shipping methods and four products. There is a strong likelihood that the data does not come from the Normal distribution. The purpose of the test is to see if the four shipping methods differ in breakage and the analysis is blocked by product. Rail Plane Truck Product 1 7960 7853 8818 Product 2 8399 7764 9432 Product 3 9429 9196 9560 Product 4 6022 5821 5676 The most appropriate method for doing this test is: a) *The Friedman Test b) The Kruskal-Wallis Test c) One-way ANOVA d) Two-way ANOVA e) The sign test [57] f) Another test (Name it!) 17. Assume that your decision is correct in 16. What is your null hypothesis or hypotheses? Be specific! Are you talking about rows or columns or both? Are you comparing means, medians, proportions or variances? Solution: The null hypothesis is that the medians of the columns are equal.

18. OK. Let’s see you do the test. (4) [63] Rail Plane Truck Now compute the Friedman statistic Product 1 2 1 3   2 12 2 Product 2 2 1 3  F    SRi   3rc 1 rcc 1  Product 3 2 1 3  i  Product 4 3 2 1  12 2 2 2    9  5  10   344 Sum 9 5 10 434  rcc 1 434 Check: SR    24 and 9 + 1   i   206  48  3.5 . 2 2 4  5 + 10 = 24. If you check Table 8 (excerpt below) for c  3 and r  4 , you find that the p-value is .273, so we can reject the null hypothesis of equal medians. c  3 , r  4 2 p  value  F 0.000 1.000 0.500 .931 1.500 .653 2.000 .431 3.500 .273 4.500 .125 6.000 .069 6.500 .042 8.000 .005 A Minitab verification follows. MTB > Friedman c15 c2 c3. Friedman Test: breakage 1 versus mode blocked by product S = 6.50 DF = 2 P = 0.039 Sum of mode N Est Median Ranks plane 4 8025.3 4.0 rail 4 8226.7 9.0 truck 4 8705.5 11.0

12 252y0772 11/26/07 (Page layout view!)

Grand median = 8319.2

13 252y0772 11/26/07 (Page layout view!)

ECO252 QBA2 THIRD EXAM Nov 26-29, 2007 TAKE HOME SECTION Name: ______Student Number: ______Class days and time : ______Please Note: Computer problems 2 and 3 should be turned in with the exam (2). In problem 2, the 2 way ANOVA table should be checked. The three F tests should be done with a 1% significance level and you should note whether there was (i) a significant difference between drivers, (ii) a significant difference between cars and (iii) significant interaction. In problem 3, you should show on your third graph where the regression line is. You should explain whether the coefficients are significant at the 1% level. Check what your text says about normal probability plots and analyze the plot you did. Explain the results of the t and F tests using a 5% significance level. (3)

III Do the following. (22+ points) Note: Look at 252thngs (252thngs) on the syllabus supplement part of the website before you start (and before you take exams). Show your work! State H 0 and H1 where appropriate. You have not done a hypothesis test unless you have stated your hypotheses, run the numbers and stated your conclusion. (Use a 95% confidence level unless another level is specified.) Answers without reasons or accompanying calculations usually are not acceptable. Neatness and clarity of explanation are expected. This must be turned in when you take the in-class exam. Note that from now on neatness means paper neatly trimmed on the left side if it has been torn, multiple pages stapled and paper written on only one side. Show your work!

1) The Lees, in their book on statistics for Finance majors, ask about the relationship of gasoline prices y in cents per gallon to crude oil prices x1  in dollars per barrel and present the data for the years 1975 - 1988. I have obtained most of the data for the years 1980 – 2007. It is presented below.

Row Year GasPrice CrudePrice Yr-1979 1 1980 1.25 26.07 1 2 1981 1.38 35.24 2 3 1982 1.30 31.87 3 4 1983 1.24 26.99 4 5 1984 1.21 28.63 5 6 1985 1.20 26.25 6 7 1986 0.93 14.55 7 8 1987 0.95 17.90 8 9 1988 0.96 14.67 9 10 1989 1.02 17.97 10 11 1990 1.16 22.22 11 12 1991 1.14 19.06 12 13 1992 1.13 18.43 13 14 1993 1.11 16.41 14 15 1994 1.11 15.59 15 16 1995 1.15 17.23 16 17 1996 1.23 20.71 17 18 1997 1.23 19.04 18 19 1998 1.06 12.52 19 20 1999 1.17 17.51 20 21 2000 1.51 28.26 21 22 2001 1.46 22.95 22 23 2002 1.36 24.10 23 24 2003 1.59 28.53 24 25 2004 1.88 36.98 25 26 2005 2.30 50.23 26 27 2006 * * 28 2007 3.10 90.00

14 252y0772 11/26/07 (Page layout view!)

This data set also contains the year with 1979 subtracted from it x2 . You may need to use this later. Ignore it in Problem 1. Note that the numbers for 2006 have not yet been published in my source, Statistical Abstract of the United States, and that the numbers for 2007 are my estimates for third quarter prices. These are unleaded prices, which the Lees did not use. You are supposed to use only the numbers for 1990 through 2006 and one other observation for your data. You will thus have n  17 observations. The other column is the value for the year 1980  a , where a is the second to last digit of your student number. If you are unsure of the data that you are using or if you want help with the sums that you need to do the regression go to 3takehome072a. Show your work – it is legitimate to check your results by running the problem on the computer. (In fact, I will give you 2 points extra credit for checking it and annotating the output for significance tests etc.) But I expect to see hand computations for every part of this problem.  a. Compute the regression equation Y  b0  b1 x to predict the price of gasoline on the basis of crude oil prices. (3) b. Compute R 2 . (2) c. Compute se . (2) s d. Compute b1 and do a significance test on b1 (2)

e. Compute a confidence interval for b0 . (2) f. You have a crude price for 2007. Using this, predict the gasoline price for 2007 and create a prediction interval for the price of gasoline for that year. Explain why a confidence interval for the price is inappropriate and check to see if my estimated price is in the interval. (3) g. Do an ANOVA for this regression. (3) f) Make a graph of the data. Show the trend line and the data points clearly. If you are not willing to do this neatly and accurately, don’t bother. (2) [19]

2) Now we can use the date to see if there is a trend line in addition to the effect of crude oil. a. Do a multiple regression of the price of gasoline against crude prices and the data variable, which has been massaged to make 1980 year 1. This involves a simultaneous equation solution.

Attempting to recycle b1 from the previous page won’t work. (7) c. Compute the regression sum of squares and use it in an ANOVA F test to test the usefulness of this regression. (4) b. Compute R 2 and R 2 adjusted for degrees of freedom for both this and the previous problem. Compare the values of R 2 adjusted between this and the previous problem. Use an F test to compare R 2 here with the R 2 from the previous problem. The F test here is one to see if adding a new independent variable improves the regression. This can also be done by modifying the ANOVAs in b.(4) d. Use your regression to predict the price of gasoline in 2007. Is this closer to the estimated gasoline price? Do a confidence interval and a prediction interval. (3) [37] e. Again there is extra credit for checking your results on the computer. Use the pull-down menu or try Regress GasPrice on 2 CrudePrice Yr-1979 (2)

3) According to Russell Langley, three sopranos were discussing their recent performances. Fifi noted that she got 36 curtain calls at La Scala last week, but Adalina put her down with the fact that she got 39. Could one of the singers really say that she had more curtain calls than another or could the differences just be due to chance? Personalize the data below by adding the last digit of your student number to each number in the first row. Use a 10% significance level throughout this question.

Row Fifi Adelina Maria 1 36 39 21 2 22 14 32 3 19 20 28 4 16 18 22

15 252y0772 11/26/07 (Page layout view!) a) State your hypothesis and use a method to compare means assuming that each column represents a random sample of curtain calls at La Scala. (4) b) Still assuming that these are random samples, use a method that compares medians instead. (3) c) Actually, these were not random samples. Though row 1 represents curtain calls at La Scala (Milan), row 2 was in Venice, row 3 in Naples and row 4 in Rome. Will this affect our results? Does this show anything about audiences on the four cities? Use an appropriate method to compare medians. (5) d) Do two different types of confidence intervals between Milan and the least enthusiastic opera house. Explain the difference between the intervals. (2) e) Assume that we want to compare medians instead. How does the fact that these data were collected at three opera houses affect the results? (3) f) Do you prefer the methods that compare medians or means? Don’t answer this unless you can demonstrate an informed opinion. (1) g) (Extra credit) Do a Levine test on these data and explain what it tests and shows.(3) h) (Extra credit)Check your work on the computer. This is pretty easy to do. Use the same format as in Computer Problem 2, but instead of car and driver numbers use the singers’ and cities’ names. You can use the stat and ANOVA pull-down menus for One-way ANOVA, two-way ANOVA and comparison of variances of the columns. You can use the stat and the non-parametrics pull-down menu for Friedman and Kruskal-Wallis. You also probably ought to test columns for Normality. Use the Statistics pull-down menu and basic statistics to find the normality tests. The Kolmogorov-Smirnov option is actually Lilliefors. The ANOVA menu can check for equality of variances. In light of these tests was ANOVA appropriate? You can get descriptions of unfamiliar tests by using the Help menu and the alphabetic command list or the Stat guide. (Up to 7) [58] You should note conclusions on the printout – tell what was tested and what your conclusions are using a 10% significance level.

16