Pearson's Correlation Coefficient
Total Page:16
File Type:pdf, Size:1020Kb
Statistical Reasoning and Quantitative Methods Maxime Gravoueille [email protected] Week 8 Week 8: Bivariate Statistics for Quantitative Variables • Overview of Bivariate statistics with Two Quantitative Variables. • Descriptive Statistics. • Inferential Statistics. • Application using Stata. • Correlation vs. Causation. • Assumptions. Overview Objectives: Describing the direction of the relationship, or association, between two variables. Testing whether the relationship is statistically significant or not. Measuring the strength of the relationship. This week is about the association between two quantitative variables. Examples: • Years of education and income. • GDP and level of pollution. • Income tax rates and income growth. Overview Remember we have two parts to evaluate the association between two variables: 1. Descriptive Statistics: two tools to describe the relationship between quantitative variables: • Visual assessment of the direction with a Scattergram. • Investigate the strength (and direction) with the Pearson’s Correlation Coefficient. 2. Inferential Statistics: test if the relationship between these two variables is significant or not: • Hypothesis testing with the Pearson’s Correlation Coefficient. Descriptive Statistics A first way to investigate a correlation between 2 quantitative variables is to plot a scattergram. • For each observation, we plot a data point whose position is the intersection of the value of the first variable (on the x-axis) and the second variable (on the y-axis). • In Stata: scatter depvar indvar • Example: relationship between the GDP per capita and the level of eduction. Here each dot represents a country. Descriptive Statistics A second way to measure a linear relation between 2 quantitative variables is to use a correlation coefficient. • Correlation is a bi-variate analysis that measures the strength of association between two variables and the direction of the relationship. There are several types of correlation coefficients, but the most commonly used is Pearson’s R. 퐶표푣(푋,푌) • Population notation: 휌 = , −1 ≤ 휌 ≤ 1 푉푎푟푋푉푎푟푌 1 푛 푋푖−푋ത 푌푖−푌ത • Sample notation: 푟 = σ푖=1( )( ) 푛−1 푠푋 푠푌 It only detects linear correlation: • Uncorrelated ≠unrelated: only linear relationship • Correlated ≠ unconfounded: you don’t know the difference between direct and indirect correlation Descriptive Statistics Perfect positive or negative correlation Descriptive Statistics Significant moderate or strong correlation Descriptive Statistics Insignificant weak or non-linear correlation Descriptive Statistics Interpreting the Pearson R: As a general rule: • High: r greater than 0.7 • Moderate: r between 0.3 and 0.7 • Low: r lower than 0.3 • Always comment the sign (the direction of the relationship) and the value of r (the magnitude of the relationship). Descriptive Statistics Summary Descriptive Statistics • Assert the direction of the correlation: • Positive correlation: values of x and y move in the same direction : the greater x, the greater y. • Negative correlation: values of x and y move in the opposite direction: the greater x, the smaller y. • Assert the magnitude of the correlation: • → strength of the association. • Coefficients range from -1 to +1 with 0 indicating no correlation (x and y are not related, vary independently). • The closer the coefficient is to -1 or +1, the stronger the association. Inferential Statistics We want to test if the correlation between our two quantitative variables is significant or not. As usual, two steps: 1. We need to use hypothesis testing using a t-test for Pearson’s R. • Null Hypothesis 퐻0 is 푟 = 0 (no correlation). 푛−2 • The test statistics is: 푇 = 푟 . 1−푟2 2. We choose a significance level: we need to determine the probability of getting it wrong => accept there is a relationship while there is not. • Compare directly the t-statistic to the table of t distributions. • Look at the p-value. Inferential Statistics Statistical Significance using Stata: Stata provides it directly if you add an option to the pwcorr var1 var2 command • The option ,obs shows the number of observations. • The option ,sigs hows the coefficient’s p-value. • The option ,star(.05) show significance stars at the 5% level for example. pwcorr can be used with multiple variables to create a correlation matrix. Sanity check: • Uncorrelated ≠independents: for instance, 푋 and 푌 = 푋2. • Correlated ≠ causally related: that’s the most important ! Application using Stata Back to the Body Mass Index! We want to investigate if there is a relationship between age and the Body Mass Index. Step 1 : Descriptive Statistics Application using Stata Back to the Body Mass Index! We want to investigate if there is a relationship between age and the Body Mass Index. Step 1 : Inferential Statistics The p-value is extremely close to 0 => the correlation between age and the Body Mass Index is statistically significant, even at the 1% level. Correlation vs. Causation Why does correlation is not always causation ? • Reverse causality: causality can go in both directions. • X → Y and Y → X • Omitted Variable: there is a third variable that influences both variables that is not accounted for. • Z → X and Z → Y Correlations can exist without a cause and effect relationship between the variables. • Correlation is a necessary but not sufficient condition of causality. • Theoretical explanations are always necessary to understand the correlations observed Assumptions Assumptions of Pearson’s R • Linear relationship between the variables. • Random sampling..