regression

Approaches to Analysing Politics Correlation & Regression

Covariance

Correlation Regression Johan A. Elkink Multiple regression

Model fit School of Politics & International Relations

Assumptions University College Dublin

References

10–12 April 2017 regression

1 Covariance

2 Correlation

Covariance

Correlation 3 Simple linear regression Regression

Multiple regression Model fit 4 Multiple regression Assumptions

References 5 Model fit

6 Assumptions regression Causation

For something to be a causal relationship,

• the cause should precede the effect in time; Covariance • the dependent and independent variables should correlate; Correlation • no third variable should explain this correlation; Regression

Multiple • there should be a logical explanation or causal mechanism regression linking the two. Model fit

Assumptions

References regression Outline

1 Covariance

Covariance 2 Correlation Correlation

Regression

Multiple 3 Simple linear regression regression

Model fit Assumptions 4 Multiple regression References

5 Model fit

6 Assumptions regression Covariance

As a measure of the variation in a variable, we typically use the variance, which is the average squared distance to the mean.

Covariance

Correlation Two variables can also covary or correlate.

Regression We can use covariance as a measure: Multiple regression PN (y − y¯) · (x − x¯) Model fit Cov(x, y) = i=1 i i Assumptions N References

• Positive covariance: as x is high, y is high, and vice versa. • Negative covariance: as x is high, y is low, and vice versa. • Covariance near zero: x and y are uncorrelated. regression No correlation

Cov(x,y) = −1.24

● ● ● Covariance ● ● Correlation ● ● ● Regression ● ● ● ● Multiple ● ● regression

Model fit ● ●

Assumptions References ● ● ●

● regression Positive covariance

Cov(x,y) = 2.38 ● ● ● ● ● ● Covariance ● ● ● Correlation ● Regression ● ● Multiple ● regression ● ● Model fit ● Assumptions ● References ● ● ● regression Negative covariance

Cov(x,y) = −2.33 ● ● ● ● Covariance ● Correlation ● ● Regression ● ● Multiple ● ● regression ● ● ● ● Model fit

Assumptions ● ● References ●

● regression Outline

1 Covariance

Covariance 2 Correlation Correlation

Regression

Multiple 3 Simple linear regression regression

Model fit Assumptions 4 Multiple regression References

5 Model fit

6 Assumptions regression Correlation

Just like with variance, it is difficult to know what is a high or what is a low covariance, because the scale depends on the

Covariance variation in the underlying x and y variables.

Correlation

Regression To correct for this, we can divide by the covariance by the

Multiple standard deviations: regression Model fit Cov(x, y) Cor(x, y) = Assumptions sx · sy References

This is called the Pearson correlation coefficient. While correlation is a general concept and many measures of correlation exist, most people who refer to correlation in refer to this measure. regression Pearson correlation

Cov(x, y) Covariance Cor(x, y) = s · s Correlation x y

Regression

Multiple regression • Bounded between -1 and 1. Model fit • A negative value implies negative correlation and negative Assumptions covariance. References • A positive value implies positive correlation and positive covariance. • A value close to zero implies no correlation. regression No correlation

Cov(x,y) = −1.24 Cor(x,y) = −0.50 ● ● ● Covariance ● ● Correlation ● ● ● Regression ● ● ● ● Multiple ● ● regression

Model fit ● ●

Assumptions References ● ● ●

● regression Positive correlation

Cov(x,y) = 2.38 ● ● ● Cor(x,y) = 0.89 ● ● ● Covariance ● ● ● Correlation ● Regression ● ● Multiple ● regression ● ● Model fit ● Assumptions ● References ● ● ● regression Negative correlation

Cov(x,y) = −2.33 Cor(x,y) =● −0.91 ● ● ● Covariance ● Correlation ● ● Regression ● ● Multiple ● ● regression ● ● ● ● Model fit

Assumptions ● ● References ●

● regression Outline

1 Covariance

Covariance 2 Correlation Correlation

Regression

Multiple 3 Simple linear regression regression

Model fit Assumptions 4 Multiple regression References

5 Model fit

6 Assumptions regression Linear regression

The aim of (simple) linear regression is to estimate a line that Covariance

Correlation best summarizes the relationship between two variables.

Regression The regression equation here is Multiple regression Model fit yi = b0 + b1xi + εi , Assumptions References whereby y is the dependent variable, x the independent variable, i an indicator of the case (country), b0 and b1 the model parameters, and ε the error term. regression Linear regression

Cov(x,y) = −1.24 y = (0.81)+(−0.92)x Cor(x,y) = −0.50 R−squared = 0.25 ● ● ● Covariance ● ● Correlation ● ● ● Regression ● ● ● ● Multiple ● ● regression

Model fit ● ●

Assumptions References ● ● ●

● regression Linear regression

Cov(x,y) = 2.38 y = (1.24)+(1.77)x ● ● ● Cor(x,y) = 0.89 R−squared● = 0.80 ● ● Covariance ● ● ● Correlation ● Regression ● ● Multiple ● regression ● ● Model fit ● Assumptions ● References ● ● ● regression Linear regression

Cov(x,y) = −2.33 y = (0.80)+(−1.73)x Cor(x,y) =● −0.91 R−squared = 0.82 ● ● ● Covariance ● Correlation ● ● Regression ● ● Multiple ● ● regression ● ● ● ● Model fit

Assumptions ● ● References ●

● regression Linear regression

y = (1.24)+(1.77)x ● ● ● R−squared = 0.80 ● ● ● 1.77 Covariance ● ● ● Correlation ● 1 Regression ● ● Multiple ● 1.24 regression ● ● Model fit ● Assumptions ● References ● ● ● regression Linear regression

y = (0.80)+(−1.73)x R−squared● = 0.80 ● ● ● Covariance ● Correlation ● ● Regression ● ● Multiple ● 0.80● regression ● ● ● ● Model fit 1 −1.73 Assumptions ● ● References ●

● regression Residuals

yi = b0 + b1xi + εi

Covariance Correlation The linear prediction given the parameters would be Regression yˆi = bˆ0 + bˆ1xi . Multiple regression The extend to which the real value differs from the predicted Model fit value is: Assumptions y − yˆ = y − bˆ − bˆ x = e . References i i i 0 1 i i

By this formulation, the residuals (e) are the vertical distance between a point and the regression line (i.e. not the shortest distance between the point and the line). regression Linear regression: residuals

y = (1.24)+(1.77)x ● ● ● R−squared = 0.80 ● ● ● Covariance ● ● ● Correlation ● Regression ● ● Multiple ● regression ● ● Model fit ● Assumptions ● References ● ● ● regression

Covariance To estimate the regression line, we need to estimate the Correlation parameters b0 and b1. Regression

Multiple regression Ordinary Least Squares (OLS) is the most common method

Model fit to do so. With OLS, we estimate the parameters such that the Assumptions sum of squared residuals are minimized. References (This is the same as minimizing the variance of the residuals.) regression INES example

● 10 leftRight = (3.65)+(0.03)age● ● ●● ● R−squared = 0.80 ● ● ● ●● Covariance 8 ● ● ● ●● ●● ● ● Correlation ●● ● ● ● ●● ●●●●● ●● ●●●● ●● ● ●● ● Regression

6 ●●●● ●●●●● ● ● ● ● Multiple ● ● ● ● regression ● ● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●● ●● ●● ● ● Model fit ● ● 4 ●● ● ●●●●●● ●●● ●●●

Assumptions Left−right self−placement ● ●● ● ● ●● ●● References ● 2 ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● 0 ● ● ●● ● ●

20 30 40 50 60 70 80

Age regression INES example

Dependent variable: leftRight Covariance ∗∗∗ Correlation age 0.029

Regression (0.010)

Multiple regression Constant 3.649∗∗∗ Model fit (0.523) Assumptions

References Observations 200 R2 0.038 Residual Std. Error 2.141 (df = 198) F Statistic 7.840∗∗∗ (df = 1; 198) Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01 regression Outline

1 Covariance

Covariance 2 Correlation Correlation

Regression

Multiple 3 Simple linear regression regression

Model fit Assumptions 4 Multiple regression References

5 Model fit

6 Assumptions regression Notation

Covariance yi Value on the dependent variable for case i Correlation xi Value on the independent variable for case i Regression x¯ Mean value on the independent variable for case i Multiple regression εi The error for case i: εi = yi − yˆi Model fit βk True coefficient for variable k Assumptions βˆk Estimated coefficient for variable k References yˆi Predicted value on the dependent variable for case i regression Multiple regression

For causal inference, we can add control variables to capture confounding factors.

Covariance When adding multiple variables to a regression, we are

Correlation estimating a high-dimensional plane instead of a line.

Regression

Multiple regression

Model fit

Assumptions

References regression Multiple regression

Dependent variable: strucint Covariance oilexp −4.322∗∗ Correlation (1.901) Regression democracy 3.726∗∗∗ Multiple regression (1.188) Model fit lngdp 2.306∗∗∗ Assumptions (0.346) References lnpop 1.920∗∗∗ (0.266) Constant −11.670∗∗ (5.296)

strucinti = β1+β2oilexpi +β3democracyi +β4lngdpi +β5lnpopi +εi regression Outline

1 Covariance

Covariance 2 Correlation Correlation

Regression

Multiple 3 Simple linear regression regression

Model fit Assumptions 4 Multiple regression References

5 Model fit

6 Assumptions regression Model fit

Covariance Once we have estimated a line, we might ask how well this line summarizes the relationship between those two variables. Correlation

Regression A common measure is R2: Multiple regression PN 2 (yi − yˆi ) R2 = 1 − = 1 − i=1 . Model fit total sum of squares PN 2 i=1(yi − y¯) Assumptions

References This can be interpreted as the proportion of the variation in y explained by this model. regression Breakdown of variance

PN 2 Total Sum of Squares (TSS): i=1(yi − y¯) PN 2 Explained Sum of Squares (ESS): i=1(ˆyi − y¯) PN 2 PN 2 Residual Sum of Squares (RSS): i=1(yi − yˆi ) = i=1 εi Covariance

Correlation

Regression TSS = ESS + RSS

Multiple regression

Model fit

● Assumptions 6 ● References ● ● ● ● ● ● ● ●● ● ● ● 4 ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y Y ● ● ●● ● ● ● ● ● ●

2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ●

● −2 ● ●

−2 −1 0 X 1 2 regression Breakdown of variance

PN 2 Total Sum of Squares (TSS): i=1(yi − y¯) PN 2 Explained Sum of Squares (ESS): i=1(ˆyi − y¯) PN 2 PN 2 Residual Sum of Squares (RSS): i=1(yi − yˆi ) = i=1 εi Covariance

Correlation

Regression TSS = ESS + RSS

Multiple regression

Model fit ● ● ● ● ● ● ● ●● ● ● ● Assumptions ●● ● ● 1 ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● References ● ● ● ● ● ● ● ● ● ● ● ● residual 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

y ● ● ● ● ● ● ● ● −1 ● ● ● ● ● ● ● ● ● ●

−2 ●

● ●

−3 ● ●

−2 −1 0 1 2

x X regression Breakdown of variance

PN 2 Total Sum of Squares (TSS): i=1(yi − y¯) Covariance PN 2 Explained Sum of Squares (ESS): i=1(ˆyi − y¯) Correlation PN 2 PN 2 Residual Sum of Squares (RSS): i=1(yi − yˆi ) = i=1 εi Regression

Multiple regression TSS = ESS + RSS Model fit

Assumptions Sometimes the second is called “regression sum of squares” References (RSS) and the third “errors sum of squares” (ESS), which might in fact be more accurate, since ε really represents errors, not residuals, in this specification. Beware the confusion! regression R2

How much of the variance did we explain?

Covariance

Correlation PN 2 PN 2 Regression 2 RSS i=1(yi − yˆi ) i=1(y ˆi − y¯) R = 1 − = 1 − N = N Multiple TSS P (y − y¯)2 P (y − y¯)2 regression i=1 i i=1 i

Model fit Assumptions Can be interpreted as the proportion of total variance explained References by the model. For this interpretation, the model must include an intercept.

For simple linear regression (i.e. one independent variable), R2 is the same as the correlation coefficient, Pearson’s r, squared. regression INES example

Dependent variable: leftRight Covariance ∗∗∗ Correlation age 0.029

Regression (0.010)

Multiple regression Constant 3.649∗∗∗ Model fit (0.523) Assumptions

References Observations 200 R2 0.038 Residual Std. Error 2.141 (df = 198) F Statistic 7.840∗∗∗ (df = 1; 198) Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01 regression Outline

1 Covariance

Covariance 2 Correlation Correlation

Regression

Multiple 3 Simple linear regression regression

Model fit Assumptions 4 Multiple regression References

5 Model fit

6 Assumptions regression Regression assumptions

When interpreting a , we make a number of assumptions, among which: Covariance Correlation • The variance of the errors (ε) is assumed to be constant, Regression regardless of the value of y or x, which is called Multiple regression homoskedasticity. Violations of this assumptions are Model fit called heteroskedastic. Assumptions • The errors are independent of each other, not correlated References with each other, which is called no autocorrelation. • The true relationship is indeed linear. • The errors have a normal distribution with mean zero (the variance is estimated). regression Conclusion

• Understand the concepts of covariance and correlation. Covariance • Be able to read a regression table—in a published article Correlation and in Stata output—and understand how a regression Regression line is related to a scatter plot. Multiple regression • Know what a regression coefficient is and what R2 Model fit represents. Assumptions • References Know the most important assumptions underlying linear regression. • Understand the relationship between multiple regression analysis and control variables. regression Ross, Michael. 2004. “Does taxation lead to representation?” British Journal of Political Science 34:229–249. Verzani, John. 2005. Using R for introductory statistics. Boca Raton, FL: Chapman & Hall/CRC.

Covariance

Correlation

Regression

Multiple regression

Model fit

Assumptions

References