Correlation and Regression Overview

Introduction to Statistical Methods for Measuring “Omics” and Field Data Correlation and regression Overview • Correlation • Simple Linear Regression Correlation General Overview of Correlational Analysis • The purpose is to measure the strength of a linear relationship between 2 variables. • A correlation coefficient does not ensure “causation” (i.e. a change in X causes a change in Y) • X is typically the input, measured, or independent variable. • Y is typically the output, predicted, or dependent variable. • If X increases and there is a predictable shift in the values of Y, a correlation exists. General Properties of Correlation Coefficients • Values can range between +1 and -1 • The value of the correlation coefficient represents the scatter of points on a scatterplot • You should be able to look at a scatterplot and estimate what the correlation would be • You should be able to look at a correlation coefficient and visualize the scatterplot Interpretation • Depends on what the purpose of the study is… but here is a “general guideline”... • Value = magnitude of the relationship • Sign = direction of the relationship Correlation graph Strong relationships Weak relationships Y Y Positive correlation X X Y Y Negative correlation X X The Pearson Correlation Coefficient Correlation Coefficient • The correlation coefficient is a measure of the strength and the direction of a linear relationship between two variables. The symbol r represents the sample correlation coefficient. The formula for r is n∑xy − ∑x ∑ y r = ( )( ) . 2 2 n∑x 2 − (∑x) n∑ y 2 − (∑ y) • The range of the correlation coefficient is -1 to 1. If x and y have a strong positive linear correlation, r is close to 1. If x and y have a strong negative linear correlation, r is close to -1. If there is no linear correlation or a weak linear correlation, r is close to 0. Calculating a Correlation Coefficient Calculating a Correlation Coefficient In Words In Symbols 1. Find the sum of the x-values. åx 2. Find the sum of the y-values. å y 3. Multiply each x-value by its corresponding y-value and find the sum. åxy 4. Square each x-value and find the sum. åx 2 5. Square each y-value and find the sum. å y 2 6. Use these five sums to calculate the correlation coefficient. nxyå-åå( x)( y) r = . nxå-å22( x)22 ny å-å( y) Continued. Correlation Coefficient Example: Calculate the correlation coefficient r for the following data. x y 1 – 3 2 – 1 3 0 4 1 5 2 nxyå-åå( x)( y) r = nxå-å22( x)22 ny å-å( y) Correlation Coefficient Example: Calculate the correlation coefficient r for the following data. x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4 nxyå-åå( x)( y) r = nxå-å22( x)22 ny å-å( y) Correlation Coefficient Example: Calculate the correlation coefficient r for the following data. x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4 åx =15 åy =-1 åxy =9 åx 2 =55 åy 2 =15 nxyå-åå( x)( y) 5(9) − 15 −1 r = = ( )( ) 2222 2 nxå-å( x) ny å-å( y) 5(55) −152 5(15) − (−1) 60 There is a strong positive = » 0.986 50 74 linear correlation between x and y. Significance Test for Correlation • Hypothesis H0: ρ = 0 (no correlation) HA: ρ ≠ 0 (correlation exists) • Test statistic (with n – 2 degrees of freedom) R/RStudio x y 1 – 3 2 – 1 3 0 4 1 5 2 R/Rstudio Function cor.test is used for calculating correlation X <- c(1,2,3,4,5) Y<-c(-3,-1,0,1,2) cor.test(X,Y) Linear Regression Linear regression • Deals with relationship between two variables X and Y. • Y is the variables whose “behavior” we wish to study ( e.g., fuel efficiency in a car). • X is the variable we believe would help explain the behavior of Y (e.g., the size of the car). Regression model • The simple linear regression model: Regression hypothesis The T statistics for the hypothesis test is t= B1/standard error of b Components of the models Regression Line • A regression line, also called a line of best fit, is the line for which the sum of the squares of the residuals is a minimum. The Equation of a Regression Line The equation of a regression line for an independent variable x and a dependent variable y is ŷ = mx + b where ŷ is the predicted y-value for a given x-value. The slope m and y-intercept b are given by nå-åå xy( x)( y) ååyx mbymxm== and -=- nxå-å2 ( x)2 nn where yx is the mean of the y- values and is the mean of the xxy-values. The regression line always passes through ( , ). Regression Line Example: Find the equation of the regression line. x y 1 – 3 2 – 1 3 0 4 1 5 2 Continued. Regression Line Example: Find the equation of the regression line. x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4 Continued. Regression Line Example: Find the equation of the regression line. x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4 åx =15 åy =-1 åxy =9 åx 2 =55 åy 2 =15 Continued. Regression Line Example: Find the equation of the regression line. x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4 åx =15 åy =-1 åxy =9 åx 2 =55 åy 2 =15 nå-åå xy( x)( y) 5(9)--( 15)( 1) 60 m = = = =1.2 nxå-å2 ( x)2 5(55)- ( 15)2 50 Continued. Regression Line Example: The following data represents the number of hours 12 different students watched television during the weekend and the scores of each student who took a test the following Monday. a.) Find the equation of the regression line. b.) Use the equation to find the expected test score for a student who watches 9 hours of TV. Hours, x 0 1 2 3 3 5 5 5 6 7 7 10 Test score, y 96 85 82 74 95 68 76 84 58 65 75 50 xy 0 85 164 222 285 340 380 420 348 455 525 500 x2 0 1 4 9 9 25 25 25 36 49 49 100 y2 9216 7225 6724 5476 9025 4624 5776 7056 3364 4225 5625 2500 åx =54 åy =908 åxy =3724 åx 2 =332 åy 2 =70836 Regression Line Example continued: Using the equation ŷ = -4.07x + 93.97, we can predict the test score for a student who tool watches 9 hours of TV. ŷ = –4.07x + 93.97 = –4.07(9) + 93.97 = 57.34 A student who watches 9 hours of TV over the weekend can expect to receive about a 57.34 on Monday’s test. Variation About a Regression Line The total variation about a regression line is the sum of the squares of the differences between the y-value of each ordered pair and the mean of y. 2 Total variation =å(yyi - ) The explained variation is the sum of the squares of the differences between each predicted y-value and the mean of y. 2 Explained variation =å(yyˆi - ) The unexplained variation is the sum of the squares of the differences between the y-value of each ordered pair and each corresponding predicted y-value. 2 Unexplained variation =å(yyii - ˆ ) Total variation=+ Explained variation Unexplained variation Coefficient of Determination • The coefficient of determination R2 is the ratio of the explained variation to the total variation. That is, 2 Explained variation R = Total variation Example: • The correlation coefficient for the data that represents the number of hours students watched television and the test scores of each student is r » -0.831. Find the coefficient of determination. R 2 ≈ (−0.831)2 About 69.1% of the variation in the test scores can be explained by the variation in the hours » 0.691 of TV watched. About 30.9% of the variation is unexplained. RStudio • Function cor.test is used to calculate correlation r, and t statistics. • Function lm is used to calculate regression • Example: Hours, x 0 1 2 3 3 5 5 5 6 7 7 10 Test score, y 96 85 82 74 95 68 76 84 58 65 75 50 • X<-c(0,1,2,3,3,5,5,5,6,7,7,10) • Y<-c(96,85,82,74,95,68,76,84,58,65,75,50) • cor.test(X,Y) • G<-lm(X~Y) • Summary(G) RStudio Count<-c(9,25,15,2,14,25,24,47) > Count [1] 9 25 15 2 14 25 24 47 Ø Speed<-c(2,3,5,9,14,24,29,34) Ø G<-lm(Count~Speed) Ø > summary(G) Ø Call: Ø lm(formula = Count ~ Speed) Ø Residuals: Ø Min 1Q Median 3Q Max Ø -13.377 -5.801 -1.542 5.051 14.371 Ø Coefficients: Ø Estimate Std. Error t value Pr(>|t|) Ø (Intercept) 8.2546 5.8531 1.410 0.2081 Ø Speed 0.7914 0.3081 2.569 0.0424 * Ø --- Ø Signif.

Correlation and Regression Overview

Principal Component Analysis (PCA) As a Statistical Tool for Identifying Key Indicators of Nuclear Power Plant Cable Insulation

Measures of Explained Variation and the Base-Rate Problem for Logistic Regression

Principal Component Analysis and Optimization: a Tutorial Robert Reris Virginia Commonwealth University, [email protected]

Prediction of Stress Increase in Unbonded Tendons Using Sparse Principal Component Analysis

Simple Linear Regression 80 60 Rating 40 20

Linear Regression

Measures of Explained Variation in Gamma Regression Models

Linear Regression Using Ordinary Least Squares

Statistical Analysis in JASP

Choosing Sample Size

Principal Component Analysis and Optimization: a Tutorial Robert Reris Virginia Commonwealth University, [email protected]

What Is R2 All About?