regression
Approaches to Analysing Politics Correlation & Regression
Covariance
Correlation Regression Johan A. Elkink Multiple regression
Model fit School of Politics & International Relations
Assumptions University College Dublin
References
10–12 April 2017 regression
1 Covariance
2 Correlation
Covariance
Correlation 3 Simple linear regression Regression
Multiple regression Model fit 4 Multiple regression Assumptions
References 5 Model fit
6 Assumptions regression Causation
For something to be a causal relationship,
• the cause should precede the effect in time; Covariance • the dependent and independent variables should correlate; Correlation • no third variable should explain this correlation; Regression
Multiple • there should be a logical explanation or causal mechanism regression linking the two. Model fit
Assumptions
References regression Outline
1 Covariance
Covariance 2 Correlation Correlation
Regression
Multiple 3 Simple linear regression regression
Model fit Assumptions 4 Multiple regression References
5 Model fit
6 Assumptions regression Covariance
As a measure of the variation in a variable, we typically use the variance, which is the average squared distance to the mean.
Covariance
Correlation Two variables can also covary or correlate.
Regression We can use covariance as a measure: Multiple regression PN (y − y¯) · (x − x¯) Model fit Cov(x, y) = i=1 i i Assumptions N References
• Positive covariance: as x is high, y is high, and vice versa. • Negative covariance: as x is high, y is low, and vice versa. • Covariance near zero: x and y are uncorrelated. regression No correlation
Cov(x,y) = −1.24
● ● ● Covariance ● ● Correlation ● ● ● Regression ● ● ● ● Multiple ● ● regression
Model fit ● ●
Assumptions References ● ● ●
● regression Positive covariance
Cov(x,y) = 2.38 ● ● ● ● ● ● Covariance ● ● ● Correlation ● Regression ● ● Multiple ● regression ● ● Model fit ● Assumptions ● References ● ● ● regression Negative covariance
Cov(x,y) = −2.33 ● ● ● ● Covariance ● Correlation ● ● Regression ● ● Multiple ● ● regression ● ● ● ● Model fit
Assumptions ● ● References ●
● regression Outline
1 Covariance
Covariance 2 Correlation Correlation
Regression
Multiple 3 Simple linear regression regression
Model fit Assumptions 4 Multiple regression References
5 Model fit
6 Assumptions regression Correlation
Just like with variance, it is difficult to know what is a high or what is a low covariance, because the scale depends on the
Covariance variation in the underlying x and y variables.
Correlation
Regression To correct for this, we can divide by the covariance by the
Multiple standard deviations: regression Model fit Cov(x, y) Cor(x, y) = Assumptions sx · sy References
This is called the Pearson correlation coefficient. While correlation is a general concept and many measures of correlation exist, most people who refer to correlation in statistics refer to this measure. regression Pearson correlation
Cov(x, y) Covariance Cor(x, y) = s · s Correlation x y
Regression
Multiple regression • Bounded between -1 and 1. Model fit • A negative value implies negative correlation and negative Assumptions covariance. References • A positive value implies positive correlation and positive covariance. • A value close to zero implies no correlation. regression No correlation
Cov(x,y) = −1.24 Cor(x,y) = −0.50 ● ● ● Covariance ● ● Correlation ● ● ● Regression ● ● ● ● Multiple ● ● regression
Model fit ● ●
Assumptions References ● ● ●
● regression Positive correlation
Cov(x,y) = 2.38 ● ● ● Cor(x,y) = 0.89 ● ● ● Covariance ● ● ● Correlation ● Regression ● ● Multiple ● regression ● ● Model fit ● Assumptions ● References ● ● ● regression Negative correlation
Cov(x,y) = −2.33 Cor(x,y) =● −0.91 ● ● ● Covariance ● Correlation ● ● Regression ● ● Multiple ● ● regression ● ● ● ● Model fit
Assumptions ● ● References ●
● regression Outline
1 Covariance
Covariance 2 Correlation Correlation
Regression
Multiple 3 Simple linear regression regression
Model fit Assumptions 4 Multiple regression References
5 Model fit
6 Assumptions regression Linear regression
The aim of (simple) linear regression is to estimate a line that Covariance
Correlation best summarizes the relationship between two variables.
Regression The regression equation here is Multiple regression Model fit yi = b0 + b1xi + εi , Assumptions References whereby y is the dependent variable, x the independent variable, i an indicator of the case (country), b0 and b1 the model parameters, and ε the error term. regression Linear regression
Cov(x,y) = −1.24 y = (0.81)+(−0.92)x Cor(x,y) = −0.50 R−squared = 0.25 ● ● ● Covariance ● ● Correlation ● ● ● Regression ● ● ● ● Multiple ● ● regression
Model fit ● ●
Assumptions References ● ● ●
● regression Linear regression
Cov(x,y) = 2.38 y = (1.24)+(1.77)x ● ● ● Cor(x,y) = 0.89 R−squared● = 0.80 ● ● Covariance ● ● ● Correlation ● Regression ● ● Multiple ● regression ● ● Model fit ● Assumptions ● References ● ● ● regression Linear regression
Cov(x,y) = −2.33 y = (0.80)+(−1.73)x Cor(x,y) =● −0.91 R−squared = 0.82 ● ● ● Covariance ● Correlation ● ● Regression ● ● Multiple ● ● regression ● ● ● ● Model fit
Assumptions ● ● References ●
● regression Linear regression
y = (1.24)+(1.77)x ● ● ● R−squared = 0.80 ● ● ● 1.77 Covariance ● ● ● Correlation ● 1 Regression ● ● Multiple ● 1.24 regression ● ● Model fit ● Assumptions ● References ● ● ● regression Linear regression
y = (0.80)+(−1.73)x R−squared● = 0.80 ● ● ● Covariance ● Correlation ● ● Regression ● ● Multiple ● 0.80● regression ● ● ● ● Model fit 1 −1.73 Assumptions ● ● References ●
● regression Residuals
yi = b0 + b1xi + εi
Covariance Correlation The linear prediction given the parameters would be Regression yˆi = bˆ0 + bˆ1xi . Multiple regression The extend to which the real value differs from the predicted Model fit value is: Assumptions y − yˆ = y − bˆ − bˆ x = e . References i i i 0 1 i i
By this formulation, the residuals (e) are the vertical distance between a point and the regression line (i.e. not the shortest distance between the point and the line). regression Linear regression: residuals
y = (1.24)+(1.77)x ● ● ● R−squared = 0.80 ● ● ● Covariance ● ● ● Correlation ● Regression ● ● Multiple ● regression ● ● Model fit ● Assumptions ● References ● ● ● regression Ordinary Least Squares
Covariance To estimate the regression line, we need to estimate the Correlation parameters b0 and b1. Regression
Multiple regression Ordinary Least Squares (OLS) is the most common method
Model fit to do so. With OLS, we estimate the parameters such that the Assumptions sum of squared residuals are minimized. References (This is the same as minimizing the variance of the residuals.) regression INES example
● 10 leftRight = (3.65)+(0.03)age● ● ●● ● R−squared = 0.80 ● ● ● ●● Covariance 8 ● ● ● ●● ●● ● ● Correlation ●● ● ● ● ●● ●●●●● ●● ●●●● ●● ● ●● ● Regression
6 ●●●● ●●●●● ● ● ● ● Multiple ● ● ● ● regression ● ● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●● ●● ●● ● ● Model fit ● ● 4 ●● ● ●●●●●● ●●● ●●●
Assumptions Left−right self−placement ● ●● ● ● ●● ●● References ● 2 ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● 0 ● ● ●● ● ●
20 30 40 50 60 70 80
Age regression INES example
Dependent variable: leftRight Covariance ∗∗∗ Correlation age 0.029
Regression (0.010)
Multiple regression Constant 3.649∗∗∗ Model fit (0.523) Assumptions
References Observations 200 R2 0.038 Residual Std. Error 2.141 (df = 198) F Statistic 7.840∗∗∗ (df = 1; 198) Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01 regression Outline
1 Covariance
Covariance 2 Correlation Correlation
Regression
Multiple 3 Simple linear regression regression
Model fit Assumptions 4 Multiple regression References
5 Model fit
6 Assumptions regression Notation
Covariance yi Value on the dependent variable for case i Correlation xi Value on the independent variable for case i Regression x¯ Mean value on the independent variable for case i Multiple regression εi The error for case i: εi = yi − yˆi Model fit βk True coefficient for variable k Assumptions βˆk Estimated coefficient for variable k References yˆi Predicted value on the dependent variable for case i regression Multiple regression
For causal inference, we can add control variables to capture confounding factors.
Covariance When adding multiple variables to a regression, we are
Correlation estimating a high-dimensional plane instead of a line.
Regression
Multiple regression
Model fit
Assumptions
References regression Multiple regression
Dependent variable: strucint Covariance oilexp −4.322∗∗ Correlation (1.901) Regression democracy 3.726∗∗∗ Multiple regression (1.188) Model fit lngdp 2.306∗∗∗ Assumptions (0.346) References lnpop 1.920∗∗∗ (0.266) Constant −11.670∗∗ (5.296)
strucinti = β1+β2oilexpi +β3democracyi +β4lngdpi +β5lnpopi +εi regression Outline
1 Covariance
Covariance 2 Correlation Correlation
Regression
Multiple 3 Simple linear regression regression
Model fit Assumptions 4 Multiple regression References
5 Model fit
6 Assumptions regression Model fit
Covariance Once we have estimated a line, we might ask how well this line summarizes the relationship between those two variables. Correlation
Regression A common measure is R2: Multiple regression PN 2 residual sum of squares (yi − yˆi ) R2 = 1 − = 1 − i=1 . Model fit total sum of squares PN 2 i=1(yi − y¯) Assumptions
References This can be interpreted as the proportion of the variation in y explained by this model. regression Breakdown of variance
PN 2 Total Sum of Squares (TSS): i=1(yi − y¯) PN 2 Explained Sum of Squares (ESS): i=1(ˆyi − y¯) PN 2 PN 2 Residual Sum of Squares (RSS): i=1(yi − yˆi ) = i=1 εi Covariance
Correlation
Regression TSS = ESS + RSS
Multiple regression
Model fit
● Assumptions 6 ● References ● ● ● ● ● ● ● ●● ● ● ● 4 ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y Y ● ● ●● ● ● ● ● ● ●
2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ●
●
● −2 ● ●
−2 −1 0 X 1 2 regression Breakdown of variance
PN 2 Total Sum of Squares (TSS): i=1(yi − y¯) PN 2 Explained Sum of Squares (ESS): i=1(ˆyi − y¯) PN 2 PN 2 Residual Sum of Squares (RSS): i=1(yi − yˆi ) = i=1 εi Covariance
Correlation
Regression TSS = ESS + RSS
Multiple regression
Model fit ● ● ● ● ● ● ● ●● ● ● ● Assumptions ●● ● ● 1 ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● References ● ● ● ● ● ● ● ● ● ● ● ● residual 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
y ● ● ● ● ● ● ● ● −1 ● ● ● ● ● ● ● ● ● ●
●
−2 ●
● ●
−3 ● ●
−2 −1 0 1 2
x X regression Breakdown of variance
PN 2 Total Sum of Squares (TSS): i=1(yi − y¯) Covariance PN 2 Explained Sum of Squares (ESS): i=1(ˆyi − y¯) Correlation PN 2 PN 2 Residual Sum of Squares (RSS): i=1(yi − yˆi ) = i=1 εi Regression
Multiple regression TSS = ESS + RSS Model fit
Assumptions Sometimes the second is called “regression sum of squares” References (RSS) and the third “errors sum of squares” (ESS), which might in fact be more accurate, since ε really represents errors, not residuals, in this specification. Beware the confusion! regression R2
How much of the variance did we explain?
Covariance
Correlation PN 2 PN 2 Regression 2 RSS i=1(yi − yˆi ) i=1(y ˆi − y¯) R = 1 − = 1 − N = N Multiple TSS P (y − y¯)2 P (y − y¯)2 regression i=1 i i=1 i
Model fit Assumptions Can be interpreted as the proportion of total variance explained References by the model. For this interpretation, the model must include an intercept.
For simple linear regression (i.e. one independent variable), R2 is the same as the correlation coefficient, Pearson’s r, squared. regression INES example
Dependent variable: leftRight Covariance ∗∗∗ Correlation age 0.029
Regression (0.010)
Multiple regression Constant 3.649∗∗∗ Model fit (0.523) Assumptions
References Observations 200 R2 0.038 Residual Std. Error 2.141 (df = 198) F Statistic 7.840∗∗∗ (df = 1; 198) Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01 regression Outline
1 Covariance
Covariance 2 Correlation Correlation
Regression
Multiple 3 Simple linear regression regression
Model fit Assumptions 4 Multiple regression References
5 Model fit
6 Assumptions regression Regression assumptions
When interpreting a regression analysis, we make a number of assumptions, among which: Covariance Correlation • The variance of the errors (ε) is assumed to be constant, Regression regardless of the value of y or x, which is called Multiple regression homoskedasticity. Violations of this assumptions are Model fit called heteroskedastic. Assumptions • The errors are independent of each other, not correlated References with each other, which is called no autocorrelation. • The true relationship is indeed linear. • The errors have a normal distribution with mean zero (the variance is estimated). regression Conclusion
• Understand the concepts of covariance and correlation. Covariance • Be able to read a regression table—in a published article Correlation and in Stata output—and understand how a regression Regression line is related to a scatter plot. Multiple regression • Know what a regression coefficient is and what R2 Model fit represents. Assumptions • References Know the most important assumptions underlying linear regression. • Understand the relationship between multiple regression analysis and control variables. regression Ross, Michael. 2004. “Does taxation lead to representation?” British Journal of Political Science 34:229–249. Verzani, John. 2005. Using R for introductory statistics. Boca Raton, FL: Chapman & Hall/CRC.
Covariance
Correlation
Regression
Multiple regression
Model fit
Assumptions
References