Approaches to Analysing Politics Correlation & Regression
Total Page:16
File Type:pdf, Size:1020Kb
regression Approaches to Analysing Politics Correlation & Regression Covariance Correlation Regression Johan A. Elkink Multiple regression Model fit School of Politics & International Relations Assumptions University College Dublin References 10{12 April 2017 regression 1 Covariance 2 Correlation Covariance Correlation 3 Simple linear regression Regression Multiple regression Model fit 4 Multiple regression Assumptions References 5 Model fit 6 Assumptions regression Causation For something to be a causal relationship, • the cause should precede the effect in time; Covariance • the dependent and independent variables should correlate; Correlation • no third variable should explain this correlation; Regression Multiple • there should be a logical explanation or causal mechanism regression linking the two. Model fit Assumptions References regression Outline 1 Covariance Covariance 2 Correlation Correlation Regression Multiple 3 Simple linear regression regression Model fit Assumptions 4 Multiple regression References 5 Model fit 6 Assumptions regression Covariance As a measure of the variation in a variable, we typically use the variance, which is the average squared distance to the mean. Covariance Correlation Two variables can also covary or correlate. Regression We can use covariance as a measure: Multiple regression PN (y − y¯) · (x − x¯) Model fit Cov(x; y) = i=1 i i Assumptions N References • Positive covariance: as x is high, y is high, and vice versa. • Negative covariance: as x is high, y is low, and vice versa. • Covariance near zero: x and y are uncorrelated. regression No correlation Cov(x,y) = −1.24 ● ● ● Covariance ● ● Correlation ● ● ● Regression ● ● ● ● Multiple ● ● regression Model fit ● ● Assumptions References ● ● ● ● regression Positive covariance Cov(x,y) = 2.38 ● ● ● ● ● ● Covariance ● ● ● Correlation ● Regression ● ● Multiple ● regression ● ● Model fit ● Assumptions ● References ● ● ● regression Negative covariance Cov(x,y) = −2.33 ● ● ● ● Covariance ● Correlation ● ● Regression ● ● Multiple ● ● regression ● ● ● ● Model fit Assumptions ● ● References ● ● regression Outline 1 Covariance Covariance 2 Correlation Correlation Regression Multiple 3 Simple linear regression regression Model fit Assumptions 4 Multiple regression References 5 Model fit 6 Assumptions regression Correlation Just like with variance, it is difficult to know what is a high or what is a low covariance, because the scale depends on the Covariance variation in the underlying x and y variables. Correlation Regression To correct for this, we can divide by the covariance by the Multiple standard deviations: regression Model fit Cov(x; y) Cor(x; y) = Assumptions sx · sy References This is called the Pearson correlation coefficient. While correlation is a general concept and many measures of correlation exist, most people who refer to correlation in statistics refer to this measure. regression Pearson correlation Cov(x; y) Covariance Cor(x; y) = s · s Correlation x y Regression Multiple regression • Bounded between -1 and 1. Model fit • A negative value implies negative correlation and negative Assumptions covariance. References • A positive value implies positive correlation and positive covariance. • A value close to zero implies no correlation. regression No correlation Cov(x,y) = −1.24 Cor(x,y) = −0.50 ● ● ● Covariance ● ● Correlation ● ● ● Regression ● ● ● ● Multiple ● ● regression Model fit ● ● Assumptions References ● ● ● ● regression Positive correlation Cov(x,y) = 2.38 ● ● ● Cor(x,y) = 0.89 ● ● ● Covariance ● ● ● Correlation ● Regression ● ● Multiple ● regression ● ● Model fit ● Assumptions ● References ● ● ● regression Negative correlation Cov(x,y) = −2.33 Cor(x,y) =● −0.91 ● ● ● Covariance ● Correlation ● ● Regression ● ● Multiple ● ● regression ● ● ● ● Model fit Assumptions ● ● References ● ● regression Outline 1 Covariance Covariance 2 Correlation Correlation Regression Multiple 3 Simple linear regression regression Model fit Assumptions 4 Multiple regression References 5 Model fit 6 Assumptions regression Linear regression The aim of (simple) linear regression is to estimate a line that Covariance Correlation best summarizes the relationship between two variables. Regression The regression equation here is Multiple regression Model fit yi = b0 + b1xi + "i ; Assumptions References whereby y is the dependent variable, x the independent variable, i an indicator of the case (country), b0 and b1 the model parameters, and " the error term. regression Linear regression Cov(x,y) = −1.24 y = (0.81)+(−0.92)x Cor(x,y) = −0.50 R−squared = 0.25 ● ● ● Covariance ● ● Correlation ● ● ● Regression ● ● ● ● Multiple ● ● regression Model fit ● ● Assumptions References ● ● ● ● regression Linear regression Cov(x,y) = 2.38 y = (1.24)+(1.77)x ● ● ● Cor(x,y) = 0.89 R−squared● = 0.80 ● ● Covariance ● ● ● Correlation ● Regression ● ● Multiple ● regression ● ● Model fit ● Assumptions ● References ● ● ● regression Linear regression Cov(x,y) = −2.33 y = (0.80)+(−1.73)x Cor(x,y) =● −0.91 R−squared = 0.82 ● ● ● Covariance ● Correlation ● ● Regression ● ● Multiple ● ● regression ● ● ● ● Model fit Assumptions ● ● References ● ● regression Linear regression y = (1.24)+(1.77)x ● ● ● R−squared = 0.80 ● ● ● 1.77 Covariance ● ● ● Correlation ● 1 Regression ● ● Multiple ● 1.24 regression ● ● Model fit ● Assumptions ● References ● ● ● regression Linear regression y = (0.80)+(−1.73)x R−squared● = 0.80 ● ● ● Covariance ● Correlation ● ● Regression ● ● Multiple ● 0.80● regression ● ● ● ● Model fit 1 −1.73 Assumptions ● ● References ● ● regression Residuals yi = b0 + b1xi + "i Covariance Correlation The linear prediction given the parameters would be Regression y^i = b^0 + b^1xi . Multiple regression The extend to which the real value differs from the predicted Model fit value is: Assumptions y − y^ = y − b^ − b^ x = e : References i i i 0 1 i i By this formulation, the residuals (e) are the vertical distance between a point and the regression line (i.e. not the shortest distance between the point and the line). regression Linear regression: residuals y = (1.24)+(1.77)x ● ● ● R−squared = 0.80 ● ● ● Covariance ● ● ● Correlation ● Regression ● ● Multiple ● regression ● ● Model fit ● Assumptions ● References ● ● ● regression Ordinary Least Squares Covariance To estimate the regression line, we need to estimate the Correlation parameters b0 and b1. Regression Multiple regression Ordinary Least Squares (OLS) is the most common method Model fit to do so. With OLS, we estimate the parameters such that the Assumptions sum of squared residuals are minimized. References (This is the same as minimizing the variance of the residuals.) regression INES example ● 10 leftRight = (3.65)+(0.03)age● ● ●● ● R−squared = 0.80 ● ● ● ●● Covariance 8 ● ● ● ●● ●● ● ● Correlation ●● ● ● ● ●● ●●●●● ●● ●●●● ●● ● ●● ● Regression 6 ●●●● ●●●●● ● ● ● ● Multiple ● ● ● ● regression ● ● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●● ●● ●● ● ● Model fit ● ● 4 ●● ● ●●●●●● ●●● ●●● Assumptions Left−right self−placement ● ●● ● ● ●● ●● References ● 2 ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● 0 ● ● ●● ● ● 20 30 40 50 60 70 80 Age regression INES example Dependent variable: leftRight Covariance ∗∗∗ Correlation age 0.029 Regression (0.010) Multiple regression Constant 3.649∗∗∗ Model fit (0.523) Assumptions References Observations 200 R2 0.038 Residual Std. Error 2.141 (df = 198) F Statistic 7.840∗∗∗ (df = 1; 198) Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01 regression Outline 1 Covariance Covariance 2 Correlation Correlation Regression Multiple 3 Simple linear regression regression Model fit Assumptions 4 Multiple regression References 5 Model fit 6 Assumptions regression Notation Covariance yi Value on the dependent variable for case i Correlation xi Value on the independent variable for case i Regression x¯ Mean value on the independent variable for case i Multiple regression "i The error for case i: "i = yi − y^i Model fit βk True coefficient for variable k Assumptions β^k Estimated coefficient for variable k References y^i Predicted value on the dependent variable for case i regression Multiple regression For causal inference, we can add control variables to capture confounding factors. Covariance When adding multiple variables to a regression, we are Correlation estimating a high-dimensional plane instead of a line. Regression Multiple regression Model fit Assumptions References regression Multiple regression Dependent variable: strucint Covariance oilexp −4.322∗∗ Correlation (1.901) Regression democracy 3.726∗∗∗ Multiple regression (1.188) Model fit lngdp 2.306∗∗∗ Assumptions (0.346) References lnpop 1.920∗∗∗ (0.266) Constant −11.670∗∗ (5.296) strucinti = β1+β2oilexpi +β3democracyi +β4lngdpi +β5lnpopi +"i regression Outline 1 Covariance Covariance 2 Correlation Correlation Regression Multiple 3 Simple linear regression regression Model fit Assumptions 4 Multiple regression References 5 Model fit 6 Assumptions regression Model fit Covariance Once we have estimated a line, we might ask how well this line summarizes the relationship between those two variables. Correlation Regression A common measure is R2: Multiple regression PN 2 residual sum of squares (yi − y^i ) R2 = 1 − = 1 − i=1 : Model fit total sum of squares PN 2 i=1(yi − y¯) Assumptions References This can be interpreted as the proportion of the variation in y explained by this model. regression Breakdown of variance PN 2 Total Sum of Squares (TSS): i=1(yi − y¯) PN 2 Explained Sum of Squares (ESS): i=1(^yi − y¯) PN 2 PN 2 Residual Sum of Squares (RSS): i=1(yi − y^i ) = i=1 "i Covariance Correlation Regression TSS = ESS