FINAL EXAM Hour of Class Registered ______DEC 11, 2003

252y0343 1/12/04 ECO252 QBA2 Name KEY FINAL EXAM Hour of Class Registered ______DEC 11, 2003

I. (25+ points) Do all the following. Note that answers without reasons receive no credit. Most answers require a statistical test, that is, stating or implying a hypothesis and showing why it is true or false by citing a table value or a p-value.

The fourth computer problem involved the regression of the Y variable below against some, but not all of the X values. Column Variables in Data Set C2 X2 Type 1 = Private, 0 = Public. C3 X1 First Quartile SAT C4 X5 3rd Quartile SAT C5 X4 Room and Board Cost C6 Y Annual Total Cost C7 X6 Average Indebtedness at Graduation C8 X3 Interaction = X1 * X2 You were directed to hand in the computer output (4 points) and your answer to problem 14.37 or the equivalent problem in the 8th edition. (Up to 7 points). I ran the same problem you did, but went on to add X4 and X5 to the input. The output appears on pages 1-7. My first two regressions were stepwise regressions. The second stepwise regression is set up to force the dummy variable designating ‘type of university’ into the equation. This means that our first equation in regression 2 is essentially . a)According to the first regression in regression 2 what are the mean annual total costs for public and private universities and how does the printout show us that they are significantly different? (2) b) Regressions 3 and 4 are the regressions you supposedly did. According to this regression for a public university the constant in the regression equation is 1013 and the slope, relative to the first quartile

SAT is 11.3339. The equation relating annual total costs to the first quartile SAT effectively has both a different intercept and a different slope; what is the equation? Are the intercepts and slopes for public and private universities, in fact significantly different? What tells us this? (3). (Extra credit: at what SAT level do public and private universities have the same cost? (2)) c) Regression 6 should be the best of all the regressions, because it has the most independent variables and the highest R-squared, but it isn’t. (i) Look at the coefficients of the independent variables and ignore their significance, one of those coefficients is incredibly unreasonable, which one is it? (1) (ii) Which coefficients are significant at the 1% level, why? (2) What about the 10% level? (1) Compare the adjusted R-squares with the other regressions, what do they tell us? (1) Look at the VIFs, what do they imply?(2) d) Do an F test to tell whether adding X3, X4 and X5 as a package to equation 3 with only X1 and X2 was useful? What is your conclusion? (4) e) I didn’t follow directions when I did a prediction interval for equation 3, so it should disagree with yours. I added some guesses as to (median?) values for X3, X4 and X5. What does the printout say I used? What would you expect should happen to the size of the prediction interval if our addition of new variables gives us a better estimate of Y? Did it happen? Cite numbers.(3) f) Use the method suggested in the text, using the standard error to compute a prediction interval for the same values of the independent variables and equation 3 – how accurate is it? (3) 32

————— 12/5/2003 7:15:10 PM ————————————————————

Welcome to Minitab, press F1 for help. MTB > Retrieve "C:\Documents and Settings\RBOVE.WCUPANET\My Documents\Drive D\MINITAB\Colleges2002.MTW". 252y0343 1/12/04

Retrieving worksheet from file: C:\Documents and Settings\RBOVE.WCUPANET\My Documents\Drive D\MINITAB\Colleges2002.MTW # Worksheet was saved on Fri Dec 05 2003

Results for: Colleges2002.MTW

MTB > Stepwise c6 c3 c2 c8 c5 c4; SUBC> AEnter 0.15; SUBC> ARemove 0.15; SUBC> Constant.

1) Stepwise Regression: Annual Total versus First quarti, Type of Scho, ...

Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15

Response is Annual T on 5 predictors, with N = 80

Step 1 2 Constant 12198 -1021 inter 10.42 8.35 T-Value 17.40 11.97 P-Value 0.000 0.000

First qu 13.3 T-Value 4.62 P-Value 0.000

S 3058 2724 R-Sq 79.51 83.95 R-Sq(adj) 79.25 83.54 C-p 20.2 1.3 More? (Yes, No, Subcommand, or Help) SUBC> yes

No variables entered or removed

More? (Yes, No, Subcommand, or Help) SUBC> no MTB > Stepwise c6 c3 c2 c8 c5 c4; SUBC> Force c2; SUBC> AEnter 0.15; SUBC> ARemove 0.15; SUBC> Constant.

2) Stepwise Regression: Annual Total versus First quarti, Type of Scho, ...

Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15

Response is Annual T on 5 predictors, with N = 80

Step 1 2 3 Constant 12478 -7264 1013

Type of 11646 8732 -3016 T-Value 13.97 11.57 -0.48 P-Value 0.000 0.000 0.630

First qu 19.5 11.3 T-Value 7.37 2.25 P-Value 0.000 0.027 inter 11.2 T-Value 1.90 P-Value 0.061

S 3610 2783 2737 R-Sq 71.44 83.24 84.00 R-Sq(adj) 71.07 82.81 83.37 C-p 58.1 4.7 3.1 More? (Yes, No, Subcommand, or Help) SUBC> yes 252y0343 1/12/04

No variables entered or removed

More? (Yes, No, Subcommand, or Help) SUBC> no MTB > Name c18 = 'RESI1' MTB > Regress c6 2 c3 c2; SUBC> Residuals 'RESI1'; SUBC> GHistogram; SUBC> GNormalplot; SUBC> GFits; SUBC> RType 1; SUBC> Constant; SUBC> VIF; SUBC> Predict c9 c10; SUBC> Brief 2.

3) Regression Analysis: Annual Total versus First quarti, Type of Scho

The regression equation is Annual Total Cost = - 7264 + 19.5 First quartile SAT + 8732 Type of School

Predictor Coef SE Coef T P VIF Constant -7264 2728 -2.66 0.009 First qu 19.524 2.651 7.37 0.000 1.4 Type of 8732.4 754.7 11.57 0.000 1.4

S = 2783 R-Sq = 83.2% R-Sq(adj) = 82.8%

Analysis of Variance

Source DF SS MS F P Regression 2 2963313624 1481656812 191.26 0.000 Residual Error 77 596492779 7746659 Total 79 3559806404

Source DF Seq SS First qu 1 1926306635 Type of 1 1037006989

Unusual Observations Obs First qu Annual T Fit SE Fit Residual St Resid 27 1040 21484 13041 514 8443 3.09R 56 1010 15722 21188 560 -5466 -2.00R 61 1320 17526 27240 578 -9714 -3.57R

R denotes an observation with a large standardized residual

Predicted Values for New Observations

New Obs Fit SE Fit 95.0% CI 95.0% PI 1 20993 579 ( 19839, 22147) ( 15332, 26654)

Values of Predictors for New Observations

New Obs First qu Type of 1 1000 1.00

Residual Histogram for Annual T

Normplot of Residuals for Annual T

Residuals vs Fits for Annual T

MTB > %Resplots c18 c2; SUBC> Title "Residuals vs Type". Executing from file: W:\wminitab13\MACROS\Resplots.MAC Macro is running ... please wait

Residual Plots: RESI1 vs Type of Scho

252y0343 1/12/04

MTB > %Resplots c18 c3; SUBC> Title "Residuals vs Type". Executing from file: W:\wminitab13\MACROS\Resplots.MAC Macro is running ... please wait

Residual Plots: RESI1 vs First quarti

MTB > Name c19 = 'RESI2' MTB > Regress c6 3 c3 c2 c8; SUBC> Residuals 'RESI2'; SUBC> GHistogram; SUBC> GNormalplot; SUBC> GFits; SUBC> RType 1; SUBC> Constant; SUBC> VIF; SUBC> Predict c9 c10 c11; SUBC> Brief 2.

4) Regression Analysis: Annual Total versus First quarti, Type of Scho, ...

The regression equation is Annual Total Cost = 1013 + 11.3 First quartile SAT - 3016 Type of School + 11.2 inter

Predictor Coef SE Coef T P VIF Constant 1013 5120 0.20 0.844 First qu 11.339 5.039 2.25 0.027 5.2 Type of -3016 6234 -0.48 0.630 97.2 inter 11.177 5.889 1.90 0.061 120.6

S = 2737 R-Sq = 84.0% R-Sq(adj) = 83.4%

Source DF Seq SS First qu 1 1926306635 Type of 1 1037006989 inter 1 26995957

Unusual Observations Obs First qu Annual T Fit SE Fit Residual St Resid 3 800 9476 10084 1176 -608 -0.25 X 9 1250 13986 15186 1303 -1200 -0.50 X 27 1040 21484 12805 520 8679 3.23R 61 1320 17526 27718 622 -10192 -3.82R

R denotes an observation with a large standardized residual X denotes an observation whose X value gives it large influence.

New Obs First qu Type of inter 1 1000 1.00 1000

MTB > Name c20 = 'RESI3' MTB > Regress c6 4 c3 c2 c8 c5; SUBC> Residuals 'RESI3'; SUBC> Constant; SUBC> VIF; SUBC> Predict c9 c10 c11 c12; SUBC> Brief 2. 252y0343 1/12/04

The regression equation is Annual Total Cost = - 13 + 11.4 First quartile SAT - 3053 Type of School + 10.9 inter + 0.165 Room and Board

Predictor Coef SE Coef T P VIF Constant -13 5483 -0.00 0.998 First qu 11.382 5.064 2.25 0.028 5.2 Type of -3053 6263 -0.49 0.627 97.3 inter 10.928 5.934 1.84 0.069 121.3 Room and 0.1655 0.3062 0.54 0.591 1.9

S = 2750 R-Sq = 84.1% R-Sq(adj) = 83.2%

Source DF Seq SS First qu 1 1926306635 Type of 1 1037006989 inter 1 26995957 Room and 1 2208452

Unusual Observations Obs First qu Annual T Fit SE Fit Residual St Resid 9 1250 13986 15174 1309 -1188 -0.49 X 27 1040 21484 12880 541 8604 3.19R 61 1320 17526 27621 650 -10095 -3.78R

New Obs First qu Type of inter Room and 1 1000 1.00 1000 5000

MTB > Name c21 = 'RESI4' MTB > Regress c6 5 c3 c2 c8 c5 c4; SUBC> Residuals 'RESI4'; SUBC> Constant; SUBC> VIF; SUBC> Predict c9 c10 c11 c12 c13; SUBC> Brief 2.

The regression equation is Annual Total Cost = 5873 + 26.2 First quartile SAT - 4605 Type of School + 12.2 inter + 0.150 Room and Board - 17.0 Third quartile SAT

Predictor Coef SE Coef T P VIF Constant 5873 8515 0.69 0.493 First qu 26.23 17.18 1.53 0.131 59.2 Type of -4605 6502 -0.71 0.481 104.5 inter 12.162 6.096 2.00 0.050 127.7 Room and 0.1503 0.3070 0.49 0.626 1.9 Third qu -17.01 18.81 -0.90 0.369 58.0

S = 2754 R-Sq = 84.2% R-Sq(adj) = 83.2% 252y0343 1/12/04

Source DF Seq SS First qu 1 1926306635 Type of 1 1037006989 inter 1 26995957 Room and 1 2208452 Third qu 1 6200863

Unusual Observations Obs First qu Annual T Fit SE Fit Residual St Resid 3 800 9476 9606 1323 -130 -0.05 X 9 1250 13986 15381 1331 -1395 -0.58 X 27 1040 21484 13192 642 8292 3.10R 61 1320 17526 27217 789 -9691 -3.67R

New Obs First qu Type of inter Room and Third qu 1 1000 1.00 1000 5000 1200

MTB > Name c22 = 'RESI5' MTB > Regress c6 2 c3 c5 ; SUBC> Residuals 'RESI5'; SUBC> Constant; SUBC> VIF; SUBC> Predict c9 c12 ; SUBC> Brief 2.

7) Regression Analysis: Annual Total versus First quarti, Room and Boa

The regression equation is Annual Total Cost = - 24258 + 27.9 First quartile SAT + 1.84 Room and Board

Predictor Coef SE Coef T P VIF Constant -24258 3686 -6.58 0.000 First qu 27.927 3.532 7.91 0.000 1.2 Room and 1.8439 0.3534 5.22 0.000 1.2

S = 3959 R-Sq = 66.1% R-Sq(adj) = 65.2%

Source DF Seq SS First qu 1 1926306635 Room and 1 426662346 252y0343 1/12/04

Unusual Observations Obs First qu Annual T Fit SE Fit Residual St Resid 14 920 7210 16752 1006 -9542 -2.49R 16 1120 9451 18272 593 -8821 -2.25R 41 1060 25865 14472 849 11393 2.95R 53 900 17886 18758 1441 -872 -0.24 X 61 1320 17526 26398 845 -8872 -2.29R

New Obs First qu Room and 1 1000 5000

Note: The assignment of the fourth computer problem was given as follows. This problem is problem 14.37 in the 9th edition of the textbook. The data are on the next two pages. As far as I can see, the problem is identical to Problem 15.10 in the 8th edition, but you should use the 9th edition data which can be downloaded from the website at http://courses.wcupa.edu/rbove/eco252/Colleges2002.MTP or the disk for the 9th edition. To download from the website enter Minitab, use the file pull-down menu, pick ‘open worksheet’ and copy the URL into ‘File name.’ If you can save the worksheet on the computer on which you are working. Read through the document 252soln K1 and use the analysis there to help you with this problem. In particular Exercise 14.35 is almost identical to this problem and my description is very similar. Once you get the data loaded, you need a column for the interaction variable and two columns for the input to the prediction interval. The problem says “ Develop a model to predict the annual total cost based on SAT score and whether the school is public or private.” The assignment was amplified as follows. Dec 5 – I finally got around to running the last computer problem . I have suggested that you carefully and neatly write up a solution to problem 14.37, using my write – up to problems in that section as a model, and turn it in with your computer output and the exam. When I ran it I used t the following allocation of columns: Column Variables in Data Set C2 X2 Type 1 = Private C3 X1 First Quartile SAT C4 X5 3rd Quartile SAT C5 X4 Room and Board Cost C6 Y Annual Total Cost C7 X6 Average Indebtedness at Graduation C8 X3 Interaction = X1 * X2 I used the columns beyond column 8 for the data for the prediction interval, for example I put 1000 in column 9, your inputs for the interval should be in the same order that you name the predictors in the pull – down menu regression instruction. Of course I didn’t actually use X6, which is part of a very different problem, though I did experiment with some of the other variables.

252y0343 1/12/04 1) There was no reason why the majority of you concluded that , the dependent variable, was either X2 (Type: 1 = Private) or X1 (First Quartile SAT). Does it make sense to explain First Quartile SAT by how much the school costs? 2) To be able to explain the results or even do the problem, you have to have read some of the posted problem solutions. I don’t see any evidence that many of you had read them.

Unfortunately, I do not have time to write the solution I wanted to see. The following comes from the text solution manual – but I hardly expected anything this complete.

14.37 (a) , where X = first quartile SAT score and X = type of institution (public = 0, 1 2 private = 1). (b) Holding constant the effect of type of institution, for each point increase on the first quartile SAT, the total cost is estimated to increase on average by $19.53. For a given first quartile SAT score, a private college or university is estimated to have an average total cost of $8732.42 over a public institution. (c) = $12260.38 (d)

First quartile SAT Residual Plot

10000 8000 6000 4000 s l 2000 a

u 0 d i -2000 s

e -4000 R -6000 -8000 -10000 -12000 0 200 400 600 800 1000 1200 1400 1600 First quartile SAT 252y0343 1/12/04

Based on a residual analysis, the model shows departure from the homoscedasticity assumption caused by the first quartile SAT score. From the normal probability plot, the residuals appear to be normally distributed with the exception of the single outlier in each of the two tails.

(e) . Reject H0. There is evidence of a relationship between total cost and the two dependent variables. 14.37 (f) For X : . Reject H . first quartile SAT score makes a 1 0 cont. significant contribution and should be included in the model. For X : . Reject H . Type of institution makes a significant contribution and 2 0 should be included in the model. Based on these results, the regression model with the two independent variables should be used. (g) ,

(h) . 83.24% of the variation in total cost can be explained by variation in first

quartile SAT score and variation in type of institution.

Normal Probability Plot

10000 8000

6000

4000 s

l 2000 a

u 0 d

i -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -2000 s

e -4000 R -6000 -8000

-10000

-12000 Z Value (i) (j) . Holding constant the effect of type of institution, 41.33% of the variation in

total cost can be explained by variation in first quartile SAT score. . Holding

constant the effect of first quartile SAT score, 63.48% of the variation in total cost can be explained by variation in type of institution. (k) The slope of total cost with first quartile SAT score is the same regardless of whether the institution is public or private. (l) .

For X1X2: the p-value is 0.0615. Do not reject H0. There is not evidence that the interaction term makes a contribution to the model. (m) The two-variable model in (a) should be used. 252y0343 12/10/03

II. Do at least 4 of the following 6 Problems (at least 13 each) (or do sections adding to at least 50 points - Anything extra you do helps, and grades wrap around) . Show your work! State and where applicable.

Use a significance level of 5% unless noted otherwise. Do not answer questions without citing appropriate statistical tests – That is, explain your hypotheses and what values from what table were used to test them.

1. A marketing analyst collects data on the screen size and price of the two models produced by a competitor. Here ‘price’ is the price in dollars, ‘size’ is screen size in inches, ‘model’ is 1 for the deluxe model (zero for the regular model) and is the column in which you may rank . Row price size model

1 371.69 13 0 169 0 138153 4832 0.00 0 1 2 403.61 19 0 361 0 162901 7669 0.00 0 3 3 484.41 21 0 441 0 234653 10173 0.00 0 4 492.89 17 1 289 1 242941 8379 492.89 17 5 606.25 25 0 625 0 367539 15156 0.00 0 6 634.41 21 1 441 1 402476 13323 634.41 21 7 651.00 210 0 44100 0 423801 136710 0.00 0 8 806.25 25 1 625 1 650039 20156 806.25 25 9 1131.00 290 0 84100 0 1279161 327990 0.00 0 9.0 10 1739.00 370 1 136900 1 3024121 643430 1739.00 370 10.0 7320.51 1011 4 268051 4 6925785 1187817 3672.55 433 a. Fill in the column.(2) Most people did this and just about everyone got credit. b. Compute the simple regression of price against size.(6) c. Compute R squared and R squared adjusted for degrees of freedom. (3) d. Compute the standard error (3) e. Compute and make it into a confidence interval for . (3) f. Do a prediction interval for the price of a model with a 19 inch screen. (4) 21 Solution: a) Fill in the column.(2) Material is in red above. b) Compute the simple regression of price against size. From above 1011, 7320.51, 268051, 1187817 and 6925785. (In spite of the fact that most column computations were done for you, many of you wasted time and energy doing them over again. Then there were those who decided that instead of the that was computed for you decided that . Anyone who did this should be sentenced to repeat ECO251.)

Spare Parts Computation: Note that the starred quantities are sums of squares and must be positive.

becomes .

252y0343 12/10/03 c) Compute R squared and R squared adjusted for degrees of freedom. (3) We already know that . so . ( must be between zero and one!) We could also try so that .

is the number of independent variables. R squared adjusted for degrees of freedom must be below R squared. d) Compute the standard error (3) and where e) Compute and make it into a confidence interval for . (3) Using the formula from the outline,

f) Do a prediction interval for the price of a model with a 19 inch screen. (4) 21 From the outline, the Prediction Interval is , where . In this formula, for some specific , . Here and , so , and . Then and , so that, if the prediction interval is . This represents a confidence interval for a particular value that will take when and is proportionally rather gigantic because we have picked a point fairly far from the mean of the data that was actually experienced. 252y0343 1/12/04

2. A marketing analyst collects data on the screen size and price of the two models produced by a competitor. Here ‘price’ is the price in dollars, ‘size’ is screen size in inches, ‘model’ is 1 for the deluxe model (zero for the regular model) and is the column in which you will rank . Row price size model

1 371.69 13 0 169 0 138153 4832 0.00 0 1 2 403.61 19 0 361 0 162901 7669 0.00 0 3 3 484.41 21 0 441 0 234653 10173 0.00 0 4 492.89 17 1 289 1 242941 8379 492.89 17 5 606.25 25 0 625 0 367539 15156 0.00 0 6 634.41 21 1 441 1 402476 13323 634.41 21 7 651.00 210 0 44100 0 423801 136710 0.00 0 8 806.25 25 1 625 1 650039 20156 806.25 25 9 1131.00 290 0 84100 0 1279161 327990 0.00 0 9.0 10 1739.00 370 1 136900 1 3024121 643430 1739.00 370 10.0 7320.51 1011 4 268051 4 6925785 1187817 3672.55 433

a. Do a multiple regression of price against size and model.(10) b. Compute R-squared and R-squared adjusted for degrees of freedom for this regression and compare them with the values for the previous problem. (4) c. Using either R – squares or SST, SSR and SSE do F tests (ANOVA). First check the usefulness of the simple regression and then the value of ‘model’ as an improvement to the regression (6) d. Predict the price of a deluxe model with a 19 inch screen – how much change is there from your last prediction? (2) Solution: a) Do a multiple regression of price against size and model.(10) We have the following spare parts from the last problem. Spare Parts Computation:

And from above 10, 4, 4*, and so that . About half of you decided that . Perhaps you thought that I was crazy to do all these computations for you. Note that the starred quantities are sums of squares and must be positive.

We need

* and * indicates quantities that must be positive.

Then we substitute these numbers into the Simplified Normal Equations:

15 252y0343 1/12/04

or and , which become

We solve the Normal Equations as two equations in two unknowns for . These are a fairly tough pair of equations to solve until we notice that, if we multiply 2.4 by 11.91667 we get 28.6 If we subtract these, we get . This means that Now remember that and this means or . So Finally we get by solving .

Thus our equation is .

b. Compute R-squared and R-squared adjusted for degrees of freedom for this regression and compare them with the values for the previous problem. (4) On the previous pages and ( must be between zero and one! R squared adjusted for degrees of freedom must be below R squared.) (The way I did it) and so ,

* so * Note that the starred quantities are sums of squares and must be positive. . If we use , which is adjusted for degrees of freedom Both of these have risen, so it looks like we did well by adding the new independent variable.

c. Using either R – squares or SST, SSR and SSE do F tests (ANOVA). First check the usefulness of the simple regression and then the value of ‘model’ as an improvement to the regression (6)

For this regression, the ANOVA reads: Source DF SS MS F Regression 2 1395874 697937 28.58 Residual Error 7 170925 24418 Total 9 1566799 Since our computed F is larger that the table F, we reject the hypothesis that X and Y are unrelated.

For the previous regression we had.

16 252y0343 1/12/04

For the previous regression, the ANOVA reads: Source DF SS MS F Regression 1 1208692 1208692 27.00 Residual Error 8 357107 44763 Total 9 1566799 Since our computed F is larger that the table F, we reject the hypothesis that X and Y are unrelated. The change in the regression sum of squares is 1395874 -1208692 = 187182, so we have

Source DF SS MS F Size 1 1208692 Model 1 187182 187182 7.775 Residual Error 7 170935 24418 Total 9 1566799 Since our computed F is larger that the table F, we reject the hypothesis that Model does not contribute to the explanation of Y.

Recall that now and before . We can rewrite the analysis with R-squared as follows. Source DF ‘SS’ MS F Size 1 .771 Model 1 .891-.771 = .120 .120 7.692 Residual Error 7 1 - .891 = .109 .0156 Total 9 1.000 This is identical with the previous ANOVA except for rounding error.

d. Predict the price of a deluxe model with a 19 inch screen – how much change is there from your last prediction? (2) 22 Our equation is . So we have . Our previous predication was 510.41, more than 10% less, which indicates that the new variable is making a difference.

17 252y0343 1/12/04

3. A marketing analyst collects data on the screen size and price of the two models produced by a competitor. Here ‘price’ is the price in dollars, ‘size’ is screen size in inches, ‘model’ is 1 for the deluxe model (zero for the regular model) and is the column in which you will rank . Row price size model

1 371.69 13 0 169 0 138153 4832 0.00 0 1 2 403.61 19 0 361 0 162901 7669 0.00 0 3 3 484.41 21 0 441 0 234653 10173 0.00 0 4.5 4 492.89 17 1 289 1 242941 8379 492.89 17 2 5 606.25 25 0 625 0 367539 15156 0.00 0 6.5 6 634.41 21 1 441 1 402476 13323 634.41 21 4.5 7 651.00 210 0 44100 0 423801 136710 0.00 0 8.0 8 806.25 25 1 625 1 650039 20156 806.25 25 6.5 9 1131.00 290 0 84100 0 1279161 327990 0.00 0 9.0 10 1739.00 370 1 136900 1 3024121 643430 1739.00 370 10.0 7320.51 1011 4 268051 4 6925785 1187817 3672.55 433

a. Compute the correlation between price and size and check to see if it is significant using the spare parts from problem 1 if you have them. (5) b. Use the same correlation to test the hypothesis that the correlation is .85 (4) c. Do ranks for the values of ‘size’ in the column, compute a rank correlation between price

and size and test it for significance using the rank correlation table if possible. (5) 14 a) , but the easiest war to compute it is to remember that and that the slope was positive so that we can take the positive square root and get . The outline says that if we want to test against and are normally distributed, we use . If we use this we get . Our rejection region is below and above . Since our computed value of t falls in the reject region, we reject the null hypothesis. b) The outline says that if we are testing against , and , the test is quite different. We need to use Fisher's z-transformation. Let . This has an approximate mean of and a standard deviation of , so that . We know and . So

18 252y0343 1/12/04

. Finally Our rejection region is below and above . Since our computed value of t does not fall in the reject region, we do not reject the null hypothesis. c. Do ranks for the values of ‘size’ in the column, compute a rank correlation between price and size and test it for significance using the rank correlation table if possible. (5) The ranking for size appears above, and the ranking of price is obvious. We now have the following columns. Row rprice rsize 1 1 1.0 0.0 0.00 2 2 3.0 -1.0 1.00 3 3 4.5 -1.5 2.25 4 4 2.0 2.0 4.00 5 5 6.5 -1.5 2.25 6 6 4.5 1.5 2.25 7 7 8.0 -1.0 1.00 8 8 6.5 1.5 2.25 9 9 9.0 0.0 0.00 10 10 10.0 0.0 0.00 0.0 15.00 Since =0.9091. If we check the table ‘Critical Values of the Spearman Rank Correlation Coefficient,’ we find that the critical value for and is .55150 so we must reject the null hypothesis and we conclude that we cannot say that the rankings agree.

19 252y0343 1/12/04

4. Explain the following. a. Under what circumstances you could use a Chi squared method to test for Normality but not a Kolmogorov - Smirnov? (2) b. Under what circumstances could you use a Lilliefors test to test for Normality but not a Kolmogorov – Smirnov? (2) c. Under what circumstances could you use a Kruskal – Wallis test to test whether four distributions are similar but not a one – way ANOVA? (2) d. What 2 tests can be used to test for the equality of two medians? Which is more powerful? (2) f. A random sample of 21 Porsche drivers were asked how many miles they had driven in the last year and a frequency table was constructed of the data. Miles Observed frequency 0 – 4000 2 4000 – 8000 7 8000 – 12000 7 Over 12000 5 Does the data follow a Normal distribution with a mean of 8000 and a standard deviation of 2000? Do not cut the number of groups below what is presented here. Find the appropriate E or cumulative E and do the test. (6) Solution: a. When the parameters of the distribution must be computed from the data. b. When the mean and standard deviation had to be computed from the data. c. When the distributions are not approximately Normal. d. This was misstated and almost any answer that showed you knew how to do a test for 2 medians was accepted. e. f. Miles Values of 0 – 4000 -4.00 to -2.00 2 .0952 .5 - .4772 =.0228 .0624 4000 – 8000 -2.00 to 0 7 .4286 .5 .0714 8000 – 12000 0 to 2.00 7 .7619 .9772 .2153 Over 12000 2.00 and up 5 1.000 1.0000 0 21 This is a K-S test where is the cumulative probability under the Normal distribution. The

tabulated probabilities are , , and . According to the K-S table, the 5% critical value for is .287. Since the maximum discrepancy does not exceed the critical value, do not reject

20 252y0343 12/10/03

5. (Ullman) A Latin Square is an extremely effective way of doing a 3 way ANOVA. In this example the data is arranged in 4 rows and 4 columns . there are 3 factors. Factor A is rows - machines. Factor B is columns – operators and Factor C - materials is shown by a tag C1, C2, C3, and C4.each material appears once in each row or column. These are times to do a job categorized by machines, operators and cutting material. The rules are just the same as in any ANOVA- degrees of freedom add up and sums of squares add up. I am going to set this up as a 2 way ANOVA with one measurement per cell. There is no interaction. We, of course assume that the parent distribution is Normal B1 B2 B3 B4 Sum SS

A 1 7 C1 4 C2 5 C3 3 C3 19 4 4.75 99 A2 6 C4 9 C1 4 C2 2 C2 21 4 5.25 137 A 3 5 C3 1 C4 6 C1 1 C1 13 4 3.25 63 A4 6 C2 3 C3 4 C4 10 C4 23 4 5.75 161 Sum 24 17 19 16 76 16 ( ) 4 4 4 4 16

6.00 4.25 4.75 4.00 ( ) SS 146 107 93 114

You now have a choice. a) If you are a real wimp, you will pretend that each column is a random sample and compare the means of each operator. (5) the table will look like that below. Source SS DF MS F

Between Within Total b) If you are less wimpy, you will pretend that this is a 2-way ANOVA and your table will look like that below (8) A choice means you don’t get full credit for doing more than one of these! Source SS DF MS F

Rows A Columns B Within Total c) If you are very daring, you will try the table below. (11) To do this you need to know that the means for the 4 materials are 8, 3.75, 3.75 and 3.50 and that the factor C sum of squares is ?. I think that the degrees of freedom should be obvious. Please don’t make the same mistakes you make on the last exam! You have 3 null hypotheses. Tell me what they are and whether you reject them. Source SS DF MS F

Rows A Columns B Materials C Within Total d) Assuming that your data is cross classified, compare the means of columns 1 and 4 using a 2-sample method. (3) e) Assume that this is the equivalent of a 2-way one-measurement per cell ANOVA, but that the underlying distribution is not Normal and do an appropriate rank test. (5)

21 252y0343 12/10/03

Solution: Lets start by playing ‘fill in the blanks.’ I am using the same format as was given in the 2-way ANOVA example with one measurement per cell. No SS can be negative, ever. And saying that an SS is zero implies that the variable it describes is a constant! We, of course, assume that the parent distribution is Normal. B1 B2 B3 B4 Sum SS

A 1 7 C1 4 C2 5 C3 3 C3 19 4 4.75 99 22.5625 A2 6 C4 9 C1 4 C2 2 C2 21 4 5.25 137 27.5625 A 3 5 C3 1 C4 6 C1 1 C1 13 4 3.25 63 10.5625 A4 6 C2 3 C3 4 C4 10 C4 23 4 5.75 161 33.0625 Sum 24 17 19 16 76 16 (4.75) 460 93.7500 4 4 4 4 16

6.00 4.25 4.75 4.00 (4.75) SS 146 107 93 114 460 36 18.0625 22.5625 16 92.625

Since most of this has been done for you, the only mystery is , which should be found by dividing by In this particular case, because the row sizes and the column sizes are equal, the overall mean can also be found by averaging the row means or averaging the column means. There are rows, columns and materials. , and . All the formulas that we need to use end with .

So . If we use for rows, we can use .

If we use for columns, we can use . This is in a one way ANOVA. Finally, I gave you . Because there are 4 items in each of the sums of squares, the degrees of freedom for each is 3. The total degrees of freedom are If we remember that sums of squares and degrees of freedom must add up, that MS is SS divided by DF and that F is MS divided by the Within MS, we get the following tables. a) If you are a real wimp, you will pretend that each column is a random sample and compare the means of each operator. (5) the table will look like that below. Column means are equal. We do not reject this hypothesis because our computed F is less than the table F.

Source SS DF MS F

Between 9.5 3 3.167 0.106ns Within 89.5 12 29.833 Total 99.0 15

22 252y0343 12/10/03 b) If you are less wimpy, you will pretend that this is a 2-way ANOVA and your table will look like that below (8) Row means are equal. We do not reject this hypothesis because our computed F is less than the table F.

Column means are equal. We do not reject this hypothesis because our computed F is less than the table F.

Source SS DF MS F

Rows A 14.0 3 4.667 0.556ns Columns B 9.5 3 3.167 0.377ns Within 75.5 9 8.389 Total 99 15 c) If you are very daring, you will try the table below. (11) To do this you need to know that the means for the 4 materials are 8, 3.75, 3.75 and 3.50 and that the factor C sum of squares is ?. I think that the degrees of freedom should be obvious. You have 3 null hypotheses. Tell me what they are and whether you reject them. Row means are equal. We do not reject this hypothesis because our computed F is less than the table F.

Column means are equal. We do not reject this hypothesis because our computed F is less than the table F.

Material means are equal. We reject this hypothesis because our computed F is larger than the table F.

Source SS DF MS F

Rows A 14.0 3 4.6667 1.474ns Columns B 9.5 3 3.1667 1 ns Materials C 56.5 3 18.833 5.947 s Within 19.0 6 3.1667 Total 99 15 d) Assuming that your data is cross classified, compare the means of columns 1 and 4 using a 2-sample method. (3) Solution: This is an easy one. Let e) Assume that this is the equivalent of a 2-way one-measurement per cell ANOVA, but that the underlying distribution is not Normal and do an appropriate rank test. (5)

statistics on or , but only on the difference,

From the document 252solnD2 we have the which is displayed above. You should be able to following. To do this problem we do not need compute and

23 252y0343 1/12/04

So we have and , which gives We need

If the paired data problem were on the formula table, it would appear as below. Interval for Confidence Hypotheses Test Ratio Critical Value Interval Difference between Two Means (paired data.)

* Same as . We can do one of the following. (i) Confidence interval – . Because this interval does not include zero, we reject our null hypothesis and conclude that there is a significant difference between the means of the populations from which the two data columns come. (ii) Test ratio – Our reject zone is below -3.182 above 3.182. Since the t-ratio falls in the upper reject zone, we reject our null hypothesis and conclude that there is a significant difference between the means of the populations from which the two data columns come.

(iii) Critical value - . Since does not fall between these 2 limits and conclude that there is a significant difference between the means of the populations from which the two data columns come. e) Assume that this is the equivalent of a 2-way one-measurement per cell ANOVA, but that the underlying distribution is not Normal and do an appropriate rank test. (5) In general if the parent distribution is Normal use ANOVA, if it's not Normal, use Friedman or Kruskal-Wallis. If the samples are independent random samples use 1-way ANOVA or Kruskal Wallis. If they are cross-classified, use Friedman or 2-way ANOVA. So the other method that allows for cross-classification is Friedman and we use it if the underlying distribution is not Normal.

The null hypothesis is or . We use a Friedman test because the data is cross-classified by store. This time we rank our data only within rows. There are columns and rows.

1 7 4 5 3 4 2 3 1 2 6 9 4 2 3 4 2 1 3 5 1 6 1 3 1.5 4 1.5

4 6 3 4 10 3 1 2 4 Sum 13 8.5 11 7.5

To check the ranking, note that the sum of the three rank sums is 13 + 8.5 + 11 +7.5 = 40, and that the sum of the rank sums should be .

24 252y0333 11/25/03 Now compute the Friedman statistic . If we check the Friedman Table for and , we find that the p-value is between .508 (for 2.7) and .432 )for 3. Since 2.775 is about halfway between 2.7 and 3, the p-value must be above 5% and we do not reject the null hypothesis. Alternately, since the table says that 7.5 has a p-value of .052 and 7.8 has a p- value of . 036, the 5% critical value must be slightly above 7.5. Since 2.775 is well below the critical value, do not reject the null hypothesis.

25 252y0343 1/12/04

6. a. A Stock moves up and down as follows. In 36 days it goes up 14 times and down 22 times. UDDDDUUUDUDDDUUDDDDUDDDUUDDDUDUUDDDU (i) Test these movements for randomness. (5) (ii) Take the first half of the series and test it for randomness – (and don’t repeat what you did in part (i) exactly. (4) b. Explain, briefly, why I did not bother with a Durbin – Watson test in the regression that began the exam (2) c. Test the hypothesis that the population the D’s and U’s above came from is evenly split between D’s and U’s (4). Solution: a)(i) U DDDD UUU D U DDD UU DDD - D U DDD UU DDD U D UU DDD U 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 . , , The outline says that for a larger problem (if and are too large for the table), follows the normal distribution with and . So Since this value of is between , we do not reject

(ii) For half the series, , , The Runs Test table says that the critical values are 5 and 14. Since 8 is between these numbers, we do not reject the null hypothesis. b. The Durbin Watson is a test for serial correlation. It is useful in problems that have a time dimension, which this problem does not have. c. We are testing that where is the proportion of D’s in the population underlying the sample.

Our table has the following. , and . Interval for Confidence Hypotheses Test Ratio Critical Value Interval

Proportion

For the test ratio or critical value method,

Critical value method: or .34 to .66. Since falls between these limits, we cannot reject the null hypothesis. Test ratio method: . Critical values for are . Since our computed falls between these limits, we cannot reject the null hypothesis.

26