Correlation and Regression Overview
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to Statistical Methods for Measuring “Omics” and Field Data Correlation and regression Overview • Correlation • Simple Linear Regression Correlation General Overview of Correlational Analysis • The purpose is to measure the strength of a linear relationship between 2 variables. • A correlation coefficient does not ensure “causation” (i.e. a change in X causes a change in Y) • X is typically the input, measured, or independent variable. • Y is typically the output, predicted, or dependent variable. • If X increases and there is a predictable shift in the values of Y, a correlation exists. General Properties of Correlation Coefficients • Values can range between +1 and -1 • The value of the correlation coefficient represents the scatter of points on a scatterplot • You should be able to look at a scatterplot and estimate what the correlation would be • You should be able to look at a correlation coefficient and visualize the scatterplot Interpretation • Depends on what the purpose of the study is… but here is a “general guideline”... • Value = magnitude of the relationship • Sign = direction of the relationship Correlation graph Strong relationships Weak relationships Y Y Positive correlation X X Y Y Negative correlation X X The Pearson Correlation Coefficient Correlation Coefficient • The correlation coefficient is a measure of the strength and the direction of a linear relationship between two variables. The symbol r represents the sample correlation coefficient. The formula for r is n∑xy − ∑x ∑ y r = ( )( ) . 2 2 n∑x 2 − (∑x) n∑ y 2 − (∑ y) • The range of the correlation coefficient is -1 to 1. If x and y have a strong positive linear correlation, r is close to 1. If x and y have a strong negative linear correlation, r is close to -1. If there is no linear correlation or a weak linear correlation, r is close to 0. Calculating a Correlation Coefficient Calculating a Correlation Coefficient In Words In Symbols 1. Find the sum of the x-values. åx 2. Find the sum of the y-values. å y 3. Multiply each x-value by its corresponding y-value and find the sum. åxy 4. Square each x-value and find the sum. åx 2 5. Square each y-value and find the sum. å y 2 6. Use these five sums to calculate the correlation coefficient. nxyå-åå( x)( y) r = . nxå-å22( x)22 ny å-å( y) Continued. Correlation Coefficient Example: Calculate the correlation coefficient r for the following data. x y 1 – 3 2 – 1 3 0 4 1 5 2 nxyå-åå( x)( y) r = nxå-å22( x)22 ny å-å( y) Correlation Coefficient Example: Calculate the correlation coefficient r for the following data. x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4 nxyå-åå( x)( y) r = nxå-å22( x)22 ny å-å( y) Correlation Coefficient Example: Calculate the correlation coefficient r for the following data. x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4 åx =15 åy =-1 åxy =9 åx 2 =55 åy 2 =15 nxyå-åå( x)( y) 5(9) − 15 −1 r = = ( )( ) 2222 2 nxå-å( x) ny å-å( y) 5(55) −152 5(15) − (−1) 60 There is a strong positive = » 0.986 50 74 linear correlation between x and y. Significance Test for Correlation • Hypothesis H0: ρ = 0 (no correlation) HA: ρ ≠ 0 (correlation exists) • Test statistic (with n – 2 degrees of freedom) R/RStudio x y 1 – 3 2 – 1 3 0 4 1 5 2 R/Rstudio Function cor.test is used for calculating correlation X <- c(1,2,3,4,5) Y<-c(-3,-1,0,1,2) cor.test(X,Y) Linear Regression Linear regression • Deals with relationship between two variables X and Y. • Y is the variables whose “behavior” we wish to study ( e.g., fuel efficiency in a car). • X is the variable we believe would help explain the behavior of Y (e.g., the size of the car). Regression model • The simple linear regression model: Regression hypothesis The T statistics for the hypothesis test is t= B1/standard error of b Components of the models Regression Line • A regression line, also called a line of best fit, is the line for which the sum of the squares of the residuals is a minimum. The Equation of a Regression Line The equation of a regression line for an independent variable x and a dependent variable y is ŷ = mx + b where ŷ is the predicted y-value for a given x-value. The slope m and y-intercept b are given by nå-åå xy( x)( y) ååyx mbymxm== and -=- nxå-å2 ( x)2 nn where yx is the mean of the y- values and is the mean of the xxy-values. The regression line always passes through ( , ). Regression Line Example: Find the equation of the regression line. x y 1 – 3 2 – 1 3 0 4 1 5 2 Continued. Regression Line Example: Find the equation of the regression line. x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4 Continued. Regression Line Example: Find the equation of the regression line. x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4 åx =15 åy =-1 åxy =9 åx 2 =55 åy 2 =15 Continued. Regression Line Example: Find the equation of the regression line. x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4 åx =15 åy =-1 åxy =9 åx 2 =55 åy 2 =15 nå-åå xy( x)( y) 5(9)--( 15)( 1) 60 m = = = =1.2 nxå-å2 ( x)2 5(55)- ( 15)2 50 Continued. Regression Line Example: The following data represents the number of hours 12 different students watched television during the weekend and the scores of each student who took a test the following Monday. a.) Find the equation of the regression line. b.) Use the equation to find the expected test score for a student who watches 9 hours of TV. Hours, x 0 1 2 3 3 5 5 5 6 7 7 10 Test score, y 96 85 82 74 95 68 76 84 58 65 75 50 xy 0 85 164 222 285 340 380 420 348 455 525 500 x2 0 1 4 9 9 25 25 25 36 49 49 100 y2 9216 7225 6724 5476 9025 4624 5776 7056 3364 4225 5625 2500 åx =54 åy =908 åxy =3724 åx 2 =332 åy 2 =70836 Regression Line Example continued: Using the equation ŷ = -4.07x + 93.97, we can predict the test score for a student who tool watches 9 hours of TV. ŷ = –4.07x + 93.97 = –4.07(9) + 93.97 = 57.34 A student who watches 9 hours of TV over the weekend can expect to receive about a 57.34 on Monday’s test. Variation About a Regression Line The total variation about a regression line is the sum of the squares of the differences between the y-value of each ordered pair and the mean of y. 2 Total variation =å(yyi - ) The explained variation is the sum of the squares of the differences between each predicted y-value and the mean of y. 2 Explained variation =å(yyˆi - ) The unexplained variation is the sum of the squares of the differences between the y-value of each ordered pair and each corresponding predicted y-value. 2 Unexplained variation =å(yyii - ˆ ) Total variation=+ Explained variation Unexplained variation Coefficient of Determination • The coefficient of determination R2 is the ratio of the explained variation to the total variation. That is, 2 Explained variation R = Total variation Example: • The correlation coefficient for the data that represents the number of hours students watched television and the test scores of each student is r » -0.831. Find the coefficient of determination. R 2 ≈ (−0.831)2 About 69.1% of the variation in the test scores can be explained by the variation in the hours » 0.691 of TV watched. About 30.9% of the variation is unexplained. RStudio • Function cor.test is used to calculate correlation r, and t statistics. • Function lm is used to calculate regression • Example: Hours, x 0 1 2 3 3 5 5 5 6 7 7 10 Test score, y 96 85 82 74 95 68 76 84 58 65 75 50 • X<-c(0,1,2,3,3,5,5,5,6,7,7,10) • Y<-c(96,85,82,74,95,68,76,84,58,65,75,50) • cor.test(X,Y) • G<-lm(X~Y) • Summary(G) RStudio Count<-c(9,25,15,2,14,25,24,47) > Count [1] 9 25 15 2 14 25 24 47 Ø Speed<-c(2,3,5,9,14,24,29,34) Ø G<-lm(Count~Speed) Ø > summary(G) Ø Call: Ø lm(formula = Count ~ Speed) Ø Residuals: Ø Min 1Q Median 3Q Max Ø -13.377 -5.801 -1.542 5.051 14.371 Ø Coefficients: Ø Estimate Std. Error t value Pr(>|t|) Ø (Intercept) 8.2546 5.8531 1.410 0.2081 Ø Speed 0.7914 0.3081 2.569 0.0424 * Ø --- Ø Signif.