STAT 3900/4950 Homework/Lab 5 Spring, 2014

STAT 3900/4950 Homework/Lab 5 Spring, 2014 Thanks to Sarah Davis Question #1 (a) SAS output

SPSS output Correlations

Overall quality Number of as an Overall course Number of times cited by instructor quality publications other authors Overall quality as an Pearson Correlation 1 .638(**) .232 .312(*) instructor Sig. (2-tailed) . .000 .105 .028 N 50 50 50 50 Overall course quality Pearson Correlation .638(**) 1 .347(*) .372(**) Sig. (2-tailed) .000 . .014 .008 N 50 50 50 50 Number of publications Pearson Correlation .232 .347(*) 1 .828(**) Sig. (2-tailed) .105 .014 . .000 N 50 50 50 50 Number of times cited Pearson Correlation .312(*) .372(**) .828(**) 1 by other authors Sig. (2-tailed) .028 .008 .000 . N 50 50 50 50 ** Correlation is significant at the 0.01 level (2-tailed). * Correlation is significant at the 0.05 level (2-tailed). Table #1

a. The p-value for the correlation between rating 1 and rating 2 is .000. b. The correlation between the number of citations and the number of publications is .83. c. The correlation between rating 1 and the number of citations is .31.

(b). The scatterplot, Figure 1 shows that the relationship between the number of articles published and the overall quality of the instructor seems to be linear but with a weak strength.

1 50 s

n 40 o i t a c i l 30 b u p

20 r e b m

u 10 N

2.50 3.00 3.50 4.00 4.50 5.00 Overall quality as an instructor Figure #1 (c). The SAS (or SPSS) output gives us:

The regression line of the overall quality of the instructor (rating_1) on the number of articles published (num_pubs) is (average) rating_1 = 4.065 + 0.012 num_pubs. Interpret the slope: The overall quality of the instructor will increase 0.012 on average for every additional article he/she published. (d). (STAT 3900 only) . . .

l l a r e v O . . .

r e b m u N . . .

r e b m u N Overall quality as an instructorNumber of publications Overall course Numberquality of times cited by other authors

2 (e). Correlation coefficients were computed among four variables. These variables were overall instructor quality, overall course quality, number of publications, and number of citations. The results of the correlational analyses presented in Table 1 show that 5 out of the 6 correlations were statistically significant (at significance level of 5%) and were greater than or equal to 0.35. The correlation between overall instructor quality and number of publications was low and not significant, which indicates that the large number of publications of an instructor does not imply he/she is also a great teacher. In addition, the results show that the relatively stronger relationship (r > 0.6) were between overall instructor quality (0.64) and overall course quality and between number of publications and number of citations (0.83). That is, the rating for course tends to be higher when the rating for the instructor gets higher. Also the more publications implies more citations.

Question #2

(a). SAS output

SPSS output : Correlation Matrix

X Z Y X Pearson 1 -.975(**) .965(**) Correlation Sig. (2-tailed) . .005 .008 N 5 5 5 Z Pearson -.975(**) 1 -.963(**) Correlation Sig. (2-tailed) .005 . .008 N 5 5 5 Y Pearson .965(**) -.963(**) 1 Correlation Sig. (2-tailed) .008 .008 . N 5 5 5 ** Correlation is significant at the 0.01 level (2-tailed).

The correlation between X and Y is .965. It is significant as its p-value is .008 less than 0.05, so X and Y are linearly associated. The correlation between X and Z is -.975. It is significant as its p-value is .005 less than 0.05, so X and Z are linearly associated. Both pairs are significantly linearly associated.

3 (b).

(c). SAS output:

SPSS output:

Coefficientsa

Unstandardized Standardized Coefficients Coefficients Model B Std. Error Beta t Sig. 1 (Constant) .789 1.259 .627 .575 X 1.524 .239 .965 6.382 .008 a. Dependent Variable: Y The regression line of Y on X is Y=.789 + 1.524X. The slope is 1.524 and the intercept is .789. The intercept is not significantly different from zero (p-value is 0.575) but the slope is significantly different from zero (p-value is 0.008). (d). The linear regression line does fit the data well because adjusted R-square is large (0.909 >> 0.75).

Adjusted R Std. Error of Model R R Square Square the Estimate 1 .965(a) .931 .909 1.37607 a Predictors: (Constant), X

4 Assumption checking: SAS plots:

Checking normality: Figure 1

The points in the normal probability plot of standardized residuals (Figure 1) are more or less around a line, and so it is reasonable to assume normality of residuals. A normality test for residuals has the same conclusion.

Checking equal variances: Figure 2

5 The residual plot (Figure 2) shows no pattern. Therefore, it is reasonable to assume equal variances of residuals at different X values.

(e). Fit the model: LY  b0  b1 X  error Table 2:

From Table 2, the fitted model is: the fitted LY  0.89  0.22X . Both the intercept and the slope coefficient are significantly different from zero, with p-values of 0.0135 and 0.0065, respectively.

The linear regression line does fit the data well because adjusted R-square is large (0.919 >> 0.75). SAS output:

Assumption Checking: SAS plots:

From Left to Right: Figures 2.5, 2.6, 2.7  Equal Variance: Figure 2.6 shows that no standardized residual is greater than ±3 (no outliers). Due to the small sample size, we cannot really judge whether or not the variances are equal. But we can observe a non-linear pattern from this residual plot, which indicates the lack of fit of our model.  Normality: Figure 2.7 shows a normality plot. The residuals fall along the line nicely. There is no indication of non-normality. However, with 5 observations we are unlikely to reject normality, even if it is violated.

6 (f). Compare their Adjusted R-square values: From model (d) Model Summary

Adjusted R Std. Error of Model R R Square Square the Estimate 1 .965(a) .931 .909 1.37607 a Predictors: (Constant), X

From model (e)

R-Sq = 93.9% R-Sq(adj) = 91.9%

I would use model (d) although the adjusted R-square of model (e) is slightly larger because using the original scale of response variable (Y here) is preferred and the residual plot of model (e) shows a bit curved.

SAS Code: Question 1

DATA QUALITY; INPUT RATING_1 RATING_2 NUM_PUBS CITES; DATALINES; 4.87 2.78 0 0 4.47 4.85 14 90 4.86 3.86 1 2 3.27 3.75 1 32 4.80 4.37 11 53 3.94 4.37 3 12 4.61 3.61 16 108 4.32 4.77 1 4 4.46 4.34 13 82 4.64 4.34 1 19 3.20 3.03 0 0 4.30 4.44 7 51 4.98 4.58 49 274 2.98 2.52 3 8 4.50 4.20 1 17 4.53 4.15 1 8 3.55 2.76 0 0 4.00 4.45 2 28 3.82 4.23 0 0 4.13 3.91 1 27 4.97 4.83 37 189 4.12 3.82 0 0 4.82 4.77 19 102 3.12 2.89 0 0

7 4.57 4.21 1 13 3.00 2.86 10 63 4.89 3.89 3 18 4.14 4.39 8 137 4.23 4.68 14 54 4.08 4.32 5 31 4.33 3.89 0 0 3.06 3.30 17 98 4.12 3.62 1 5 4.30 4.17 25 115 4.50 4.23 1 21 4.33 3.95 3 35 4.40 4.13 11 118 4.33 4.10 5 49 4.26 3.78 0 0 3.76 4.21 8 27 4.34 4.22 12 64 4.83 3.83 10 22 3.02 2.00 0 0 4.88 3.88 3 154 4.31 4.46 34 64 4.07 3.75 40 137 2.96 2.82 1 23 4.88 3.88 1 16 3.36 3.76 15 37 4.14 4.60 11 69 ; RUN;

**PART (a) CORRELATION ANALYSIS; PROC CORR DATA = QUALITY; TITLE "CORRELATION MATRIX"; VAR RATING_1 RATING_2 NUM_PUBS CITES; RUN;

**PART (b) SCATTER PLOT; SYMBOL COLOR = RED VALUE = DOT;

PROC GPLOT DATA = QUALITY; TITLE "SCATTERPLOT OF INSTRUCTOR QUALITY AND ARTICLES PUBLISHED"; PLOT RATING_1*NUM_PUBS; RUN;

**PART (c) REGRESSION LINE OF RATING_1 (Y) ON NUM_PUBS (X); PROC REG DATA = QUALITY;

8 TITLE "REGRESSION: INSTRUCTOR QUALITY ON ARTICLES PUBLISHED"; MODEL RATING_1 = NUM_PUBS; RUN;

Question 2:

DATA Q2; INPUT X Y Z; LY = LOG(Y); DATALINES; 1 3 15 7 13 7 8 12 5 3 4 14 4 7 10 ; RUN;

PROC PRINT DATA = Q2; RUN;

**PART (a) Create a correlation matriX. **Identify the Pearson correlation coefficients between X and Y, X and Z. **Which pair(s) is significantly linearly associated? Identify the p-values to support your answer; PROC CORR DATA = Q2; VAR X Y Z; RUN;

**PART (b) SCATTERPLOT Y vs X WITH A REG LINE; **PART (c) COMPUTE OF REGRESSION LINE OF Y = X; PROC REG DATA = Q2; TITLE "SCATTERPLOT OF Y AND X WITH REGRESSION LINE"; MODEL Y = X; PLOT Y*X; RUN;

**PART (d) DOES THE REGRESSION LINE FIT WELL? (R^2 > 0.75); **CHECK ASSUMPTIONS;

**PART (e) CREATE LOG(Y). REPEAT (c) AND (d); PROC REG DATA = Q2; TITLE "SCATTERPLOT OF LOG(Y) AND X WITH REGRESSION LINE"; MODEL LY = X; PLOT LY*X; RUN;