Correlation and Regression Bangkok, 14-18, Sept

Analysing and Understanding Learning Assessment for Evidence-based Policy Making

Correlation and Regression Bangkok, 14-18, Sept. 2015

Australian Council for Educational Research Correlation

• The strength of a mutual relation between 2 (or more) things • You need to know 2 things about each unit of analysis – student (e.g. maths and reading performance) – school (e.g. funding level and mean reading performance) – country (e.g. mean performance in 2010 and in 2013) • No assumption about the direction of the relationship • Correlation is simply standardised covariance – i.e., covariance divided by the product of the standard deviations of the variables. Formulas

2 2 ∑(X −X ) • Variances: σ = N −1

(X −X )2 • Standard deviation σ = ∑ N −1 ∑(X −X )(Y − Y ) • Covariances: cov(x, y) = N −1 cov(x, y) • Correlation (Pearson’s r) r = σ yσ x A note on sample vs population estimators

(X −X )2 σ 2 = ∑ • Sample variances: N

(X −X )(Y −Y ) cov(x, y) = ∑ • Sample covariances: N

• Estimate of variance based on a sample is biased, it underestimates the true variance N • Needs a correction factor of N − 1 to produce an unbiased estimate Type of correlation

• The correlation coefficient to use depends on the level of measurement of the variables

Ordinal – ranks, Likert scales, ordered categories Spearman correlation ( ), Kendall’s tau ( ) Interval/Ratio – metric scales, measures of • ρ τ magnitude Pearson correlation ( )

• ρ Things to remember

• Independence – are the two values independent of each other? • Linearity – is the relationship between the two values linear? • Normality – are the two values distributed normally? (if not, non-parametric correlation should be used) Correlation values

0 = no relationship 1.0 = perfect positive relationship -1.0 = perfect negative relationship 0.1 = weak relationship (if significant) 0.3 = moderate relationship (if significant) 0.5 = strong relationship (if significant) Strong correlation

r = .80 Perfect correlations

r = 1 r = -1 Moderate correlation

r = .36 No correlation

r = .06 Correlation vs Regression

• Correlation is not directional. The degree of association goes both ways. • Correlation is not appropriate if the substantive meaning of X being associated with Y is different from Y being associated with X. For example, Height and Weight. • Not appropriate when one of the variables is being manipulated, or being used to explain the other. Use regression instead. Practical exercises

• Be careful about spurious correlations. Just because two variables correlate highly does not mean there is a valid relationship between them. • Correlation is not causation. • With large enough data, anything can be significantly correlated with something. Regression

• Also describes a relationship between 2 things (or more), but assumes a direction • Explain one variable with one (or more) other variable(s) – How well does SES predict performance? Regression – cont.

• Two main statistics – Size of the effect or slope – Strength of the effect or explained variance The General Idea

Simple regression considers the relation between a single explanatory variable and response variable Line of best fit (OLS) Line of best fit (OLS) Size of the effect

50 = slope

1 unit Size of the effect – cont.

25 = slope

1 unit The R2

The proportion of the total sample variance that is not explained by the regression will be:

𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 Therefore, the proportion of the variance in the dependent variable that is explained by the independent variable (R2) will be:

= 1 2 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑅𝑅 − 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 Strength of the effect

For example, if the residual variance is a small proportion of the total variance

R2 = 1 – (162.5/1250) R-squared = 0.87

87 % of the variation in reading is explained by ESCS Strength – cont.

For example, if the residual variance is a large proportion of the total variance

R2 = 1 – (1075/1250) R2 = 0.14

Only 14% of the variation in reading is explained by ESCS Multiple Regression Multiple regression simultaneously considers the influence of multiple explanatory variables on a response variable Y

The intent is to look at the independent effect of each variable while “adjusting out” the influence of potential confounders

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publi Regression Modeling

• A simple regression model (one independent variable) fits a regression line in 2- residual dimensional space

• A multiple regression model with two explanatory variables fits a regression plane in 3-dimensional space

• This concept can be extended indefinitely but visualisation is no longer possible for >3 variables.

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers. Multiple Regression Model Again, estimates for the multiple slope coefficients are derived by minimizing residuals2 to derive this multiple regression model: ∑

Again, the standard error of the regression is 2 based on the residuals of all xn:

∑

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publi Multiple Regression Model

• Intercept α predicts where the regression plane crosses the Y axis

• Slope for variable X1 (β1) predicts the change in Y per unit X1 holding X2 constant • The slope for variable X2 (β2) predicts the change in Y per unit X2 holding X1 constant

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers. Main purpose of regression analysis

• Prediction – Developing a prediction model based on a set of predictor/independent variables. This purpose also allows for the evaluation of the predictive powers between different models as well as different sets of predictors within a model. • Explanation – Validating or confirming an existing prediction model using new data. This purpose also allows for the assessment of the relationship between predictor and outcome variables. Regression works provided assumptions are met

• Linearity – Check using partial regression plots (PLOTS Produce all partial plots) • Uniform variance (homoscedasticity) – Check by plotting residuals against the predicted value (PLOTS Y:ZRESID, X:ZPRED) – For ANOVA, check using Levene’s test for homogeneity of variance (EXPLORE  PLOTS Spread vs Level) • Independence of error terms – Check by plotting residuals against a sequencing variable (PLOTS Produce all partial plots) • Normality of the residuals – Check using Normal P-P plots of the residuals (PLOTS Normal probability plot) Sample size

• Thorough method: a priori power analysis – Compute sample sizes for given effect sizes, alpha levels, and power values (G*Power 3: http://www.psycho.uni- duesseldorf.de/aap/projects/gpower/) • Fast method (but less thorough): rules of thumb – For R2 significance testing: 50 + 8k – For b-values significance testing : 104 + k – For both, use the larger number Multicollinearity y= b0 + b1x1 y= b0 + b1x1 + b2x2 but if x2 = x1 + 3 y= b0 + b1x1 + b2 (x1+3)  y= b0 + b1x1 + b2 x1 + 3b2

Checking for multicollinearity For overall multicollinearity: VIF>10; Tolerance <0.10. For individual variables: Identify Condition Index >15, then check the Variance Proportions of each coefficient >.90. Influential values • Influential values are outliers that have substantial effect on the regression line.

Source: Field, A. (2005). Discovering statistics using SPSS. (2nd ed). London: Sage. When does linear regression modelling become inappropriate?

• When the dependent variable is dichotomous or polytomous (use Logistical Regression). • When data are sequential over time and variables are ‘auto correlated’ (use Time Series Analysis). • When context effects need to be analysed and slopes are different across higher level units (use Multi-level Analysis). Application: Illustrative Example

Childhood respiratory health survey. • Binary explanatory variable (SMOKE) is coded 0 for non-smoker and 1 for smoker • Response variable Forced Expiratory Volume (FEV) is measured in liters/second (lung capacity) • Regress FEV on SMOKE least squares regression line: = 2.566 + 0.711x • The mean FEV in nonsmokers is 2.566 ŷ • The mean FEV in smokers is 3.277

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers. Example, cont.

= 2.566 + 0.711x • Intercept (2.566) = the mean FEV of group 0 • ŷ • Slope = the mean difference in FEV (because x is 0,1)  3.277 − 2.566 = 0.711

• tstat = 6.464 with 652 df, p <.01 (b1 is significant) • The 95% CI for slope is 0.495 to 0.927

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers. Smoking increases lung capacity? • Children who smoked had higher mean FEV • How can this be true given what we know about the deleterious respiratory effects of smoking? • ANS: Smokers were older than the nonsmokers • AGE confounded the relationship between SMOKE and FEV • A multiple regression model can be used to adjust for AGE in this situation

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers. Extending the analysis: Multiple regression

SPSS output for our example:

Intercept a Slope b1 Slope b2

The multiple regression model is: FEV = 0.367 + −.209(SMOKE) + .231(AGE)

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers. Multiple Regression Coefficients, cont.

• The slope coefficient associated for SMOKE is −.209, suggesting that smokers have .209 less FEV on average compared to nonsmokers (after adjusting for age)

• The slope coefficient for AGE is .231, suggesting that each year of age in associated with an increase of .231 FEV units on average (after adjusting for SMOKE)

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers. Inference About the Coefficients

Inferential statistics are calculated for each regression coefficient. For example, in testing

H0: β1 = 0 (SMOKE coefficient controlling for AGE)

tstat = −2.588 and P = 0.010 Coefficientsa

Unstandardized Standardized Coefficients Coefficients Model B Std. Error Beta t Sig. 1 (Constant) .367 .081 4.511 .000 smoke -.209 .081 -.072 -2.588 .010 age .231 .008 .786 28.176 .000 a. Dependent Variable: fev df = n – k – 1 = 654 – 2 – 1 = 651

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers. Inference About the Coefficients The 95% confidence interval for this slope of SMOKE controlling for AGE is −0.368 to − 0.050.

Coefficientsa

95% Confidence Interval for B Model Lower Bound Upper Bound 1 (Constant) .207 .527 smoke -.368 -.050 age .215 .247 a. Dependent Variable: fev

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publi Assessing the significance of the model

• R Square (R2) – represents the proportion of variance in the outcome variable that is accounted for by the predictors in the model. For example, if for our previous model R2 = .23, then 23% of the variance in FEV is accounted for by smoking status and age.

• Adjusted R2 – compensates for the inflation of R2 due to overfitting. Useful for comparing the amount of variance explained across several models.

• Standard error of the estimate – measure of accuracy of the predictions. For example, if the SE of the estimate = 0.35 for our previous model: FEV = 0.367 + −.209(SMOKE) + .231(AGE) then the predicted FEV for a non-smoker aged 12 years is FEV=3.139 +/- (t x 0.35) Assessing the significance of the model

Hierarchical models

Suppose Model 1: FEV = 0.367 + −.209(SMOKE) + .231(AGE), R2 =.23 Model 2: FEV = 0.367 + −.209(SMOKE) + .231(AGE) + .04(GENDER), R2 =.29

What is the amount of unique variance explained by gender above and beyond that explained by smoking status and age?

FEV FEV

GENDER

SMOKE AGE SMOKE AGE Hierarchical regression in SPSS Dummy Variables More than two levels For categorical variables with k categories, use k–1 dummy variables

Ex. SMOKE2 has three levels, initially coded 0 = non-smoker 1 = former smoker 2 = current smoker

Use k – 1 = 3 – 1 = 2 dummy variables to code this information like this:

Source: Gertsman, B. (2008). Basic biostatistics: Statistics for public health practice. Sudbury, MA: Jones and Bartlett Publishers. Use of standardised coefficients

• Often thought to be ‘easier’ to interpret. • Standardisation depends on variances of independent variables. • Unstandardised coefficient can be translated directly. • Unstandardised coefficients cannot always be compared if different units are used for the variables. Finding the best regression model

• The set of predictors must be chosen based on theory • Avoid the “whatever sticks to the wall” approach. • The grouping of predictors and the ordering of entry will matter. • Selecting the “best” final model can sometimes be a judgment call. How to judge whether a model is good?

• Explained variance proportion as measures by R2 • Size of regression coefficients. • Significance tests (F-test for model, t- tests for parameters) • Inclusion of all relevant variables (Theory!) • Is method appropriate? The six steps to interpreting results

1. Look at the prediction equation to see an estimate of the relationship. 2. Refer to the standard error of the estimate (in the appropriate model) when making predictions for individuals. 3. Refer to the standard errors of the coefficients (in the most complete model) to see how much you can trust the estimates of the effects of the explanatory variables. 4. Look at the significance levels of the t-ratios to see how strong is the evidence in support of including each of the explanatory variables in the model. 5. Use the coefficient of determination (R2) to measure the potential explanatory power of the model. 6. Compare the beta-weights of the explanatory variables in order to rank them in order of explanatory importance. Notes on interpreting the results

• Prediction is NOT causation. • In inferring causation, there has to be at least temporal precedence, but temporal precedence alone is still not sufficient. • Avoid extrapolating the prediction equation beyond the data range. • Always consider the standard errors and the confidence intervals of the parameter estimates. • The magnitude of the coefficient of determination (R2), in terms of explanatory power, is a judgment call. Practice exercises!

Study: Mathematics Beliefs and Achievement of Elementary School Students in Japan and the United States: Results From the Third International Mathematics and Science Study (TIMSS). House, J. D., 2006

• Interpret the parameter estimates • Interpret the statistical significance of the predictors • Make substantive interpretation about the findings Extensions: Regression

Multiple regression considers the relation between a set of explanatory variables and response or outcome variable

Independent

predictor (x1) Outcome (y)

Independent

predictor (x2) Moderating effect

Moderated regression When the independent variable does not affect the outcome directly but rather affects the relationship between the predictor and the outcome.

Outcome (y) Independent

predictor (x1)

Independent

variable (x2) Moderating effect Simple Moderating effect When a categorical independent variable affects the relationship between the predictor and the outcome. C1

Y C3

X Moderating effects

Categorical moderator Continuous moderator

y = actual scaled score in the Multidimensional Perfectionism Scale (Hewitt & Flett) Types of moderators (Sharma et al., 1981)

Related to predictor Not related predictor and/or outcome and/or outcome

No Independent predictor Homologizer interaction with predictor

Interaction Quasi-moderator Pure moderator with predictor variable

Homologizer variables affect the strength (rather than the form) of the relationship between predictor and outcome (Zedeck, 1971) Testing Moderation

• Moderation effects are also known as interaction effects. • Interaction terms are product terms of the moderator and the relevant predictor (the variable that the moderator interacts)

– Y = b0 + b1x1 + b2x2 + b3m

– Interaction term = x1*m =i1

• Choosing the moderator and the relevant predictor must have theoretical support. For example, it is possible that the moderator interacts with x2 instead (i.e., x2*m =i1). • Testing for the interaction effect necessitates the inclusion of the interaction term/s in the regression equation:

– Y = b0 + b1x1 + b2x2 + b3m + b4i1

– And test H0: b4=0 Mediating effect

Mediated regression When the independent predictor does not affect the outcome directly but affects it through an intermediary variable (the mediator). Intermediary

predictor (x2)

Outcome (y) Independent

predictor (x1) Mediation vs Moderation

Mediators explain why or how an independent variable X causes the outcome Y while a moderator variable affects the magnitude and direction of the relationship between X and Y (Saunders, 1956). These two approaches can be combined for more complex analyses:

• Moderated mediation

• Mediated moderation Checkists

• Moderation – Collinearity between predictor and moderator (especially true for quasi-moderators). – Unequal variances between groups based on the moderator. – Reliability of measures (measurement errors are magnified when creating the product terms). • Mediation – Theoretical assumptions on the mediator – Rationale for selecting the mediator – Significance and type (full/partial) of the mediation effect. – Implied causation (i.e., directional paths).