Introduction to Statistical Methods for Measuring “Omics” and Field Data

Correlation and regression Overview

• Correlation

• Simple Correlation General Overview of Correlational Analysis • The purpose is to measure the strength of a linear relationship between 2 variables.

• A correlation coefficient does not ensure “causation” (i.e. a change in X causes a change in Y)

• X is typically the input, measured, or independent variable.

• Y is typically the output, predicted, or dependent variable.

• If X increases and there is a predictable shift in the values of Y, a correlation exists. General Properties of Correlation Coefficients

• Values can range between +1 and -1

• The value of the correlation coefficient represents the scatter of points on a scatterplot

• You should be able to look at a scatterplot and estimate what the correlation would be

• You should be able to look at a correlation coefficient and visualize the scatterplot Interpretation

• Depends on what the purpose of the study is… but here is a “general guideline”...

• Value = magnitude of the relationship • Sign = direction of the relationship Correlation graph

Strong relationships Weak relationships

Y Y

Positive correlation

X X

Y Y

Negative correlation

X X The Pearson Correlation Coefficient Correlation Coefficient

• The correlation coefficient is a measure of the strength and the direction of a linear relationship between two variables. The symbol r represents the sample correlation coefficient. The formula for r is

n∑xy − ∑x ∑ y r = ( )( ) . 2 2 n∑x 2 − (∑x) n∑ y 2 − (∑ y)

• The range of the correlation coefficient is -1 to 1. If x and y have a strong positive linear correlation, r is close to 1. If x and y have a strong negative linear correlation, r is close to -1. If there is no linear correlation or a weak linear correlation, r is close to 0. Calculating a Correlation Coefficient

Calculating a Correlation Coefficient In Words In Symbols 1. Find the sum of the x-values. åx 2. Find the sum of the y-values. å y 3. Multiply each x-value by its corresponding y-value and find the sum. åxy 4. Square each x-value and find the sum. åx 2 5. Square each y-value and find the sum. å y 2 6. Use these five sums to calculate the correlation coefficient. nxyå-åå( x)( y) r = . nxå-å22( x)22 ny å-å( y)

Continued. Correlation Coefficient

Example: Calculate the correlation coefficient r for the following data.

x y 1 – 3 2 – 1 3 0 4 1 5 2

nxyå-åå( x)( y) r = nxå-å22( x)22 ny å-å( y) Correlation Coefficient

Example: Calculate the correlation coefficient r for the following data.

x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4

nxyå-åå( x)( y) r = nxå-å22( x)22 ny å-å( y) Correlation Coefficient

Example: Calculate the correlation coefficient r for the following data.

x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4 åx =15 åy =-1 åxy =9 åx 2 =55 åy 2 =15

nxyå-åå( x)( y) 5(9) − 15 −1 r = = ( )( ) 2222 2 nxå-å( x) ny å-å( y) 5(55) −152 5(15) − (−1) 60 There is a strong positive = » 0.986 50 74 linear correlation between x and y. Significance Test for Correlation

• Hypothesis

H0: ρ = 0 (no correlation)

HA: ρ ≠ 0 (correlation exists) • Test statistic

(with n – 2 degrees of freedom) R/RStudio

x y 1 – 3 2 – 1 3 0 4 1 5 2

R/Rstudio Function cor.test is used for calculating correlation X <- c(1,2,3,4,5) Y<-c(-3,-1,0,1,2) cor.test(X,Y) Linear Regression Linear regression

• Deals with relationship between two variables X and Y.

• Y is the variables whose “behavior” we wish to study ( e.g., fuel efficiency in a car).

• X is the variable we believe would help explain the behavior of Y (e.g., the size of the car). Regression model

• The simple linear regression model: Regression hypothesis

The T for the hypothesis test is t= B1/standard error of b Components of the models Regression Line

• A regression line, also called a line of best fit, is the line for which the sum of the squares of the residuals is a minimum.

The Equation of a Regression Line The equation of a regression line for an independent variable x and a dependent variable y is ŷ = mx + b where ŷ is the predicted y-value for a given x-value. The slope m and y-intercept b are given by nå-åå xy( x)( y) ååyx mbymxm== and -=- nxå-å2 ( x)2 nn where yx is the mean of the y- values and is the mean of the xxy-values. The regression line always passes through ( , ). Regression Line Example: Find the equation of the regression line.

x y 1 – 3 2 – 1 3 0 4 1 5 2

Continued. Regression Line Example: Find the equation of the regression line.

x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4

Continued. Regression Line Example: Find the equation of the regression line.

x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4 åx =15 åy =-1 åxy =9 åx 2 =55 åy 2 =15

Continued. Regression Line Example: Find the equation of the regression line.

x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4 åx =15 åy =-1 åxy =9 åx 2 =55 åy 2 =15

nå-åå xy( x)( y) 5(9)--( 15)( 1) 60 m = = = =1.2 nxå-å2 ( x)2 5(55)- ( 15)2 50 Continued. Regression Line Example: The following data represents the number of hours 12 different students watched television during the weekend and the scores of each student who took a test the following Monday.

a.) Find the equation of the regression line. b.) Use the equation to find the expected test score for a student who watches 9 hours of TV.

Hours, x 0 1 2 3 3 5 5 5 6 7 7 10 Test score, y 96 85 82 74 95 68 76 84 58 65 75 50 xy 0 85 164 222 285 340 380 420 348 455 525 500 x2 0 1 4 9 9 25 25 25 36 49 49 100 y2 9216 7225 6724 5476 9025 4624 5776 7056 3364 4225 5625 2500

åx =54 åy =908 åxy =3724 åx 2 =332 åy 2 =70836 Regression Line Example continued: Using the equation ŷ = -4.07x + 93.97, we can predict the test score for a student who tool watches 9 hours of TV.

ŷ = –4.07x + 93.97

= –4.07(9) + 93.97

= 57.34

A student who watches 9 hours of TV over the weekend can expect to receive about a 57.34 on Monday’s test. Variation About a Regression Line

The total variation about a regression line is the sum of the squares of the differences between the y-value of each ordered pair and the mean of y. 2 Total variation =å(yyi - ) The explained variation is the sum of the squares of the differences between each predicted y-value and the mean of y. 2 Explained variation =å(yyˆi - ) The unexplained variation is the sum of the squares of the differences between the y-value of each ordered pair and each corresponding predicted y-value. 2 Unexplained variation =å(yyii - ˆ ) Total variation=+ Explained variation Unexplained variation Coefficient of Determination

• The coefficient of determination R2 is the ratio of the explained variation to the total variation. That is,

2 Explained variation R = Total variation

Example: • The correlation coefficient for the data that represents the number of hours students watched television and the test scores of each student is r » -0.831. Find the coefficient of determination.

R 2 ≈ (−0.831)2 About 69.1% of the variation in the test scores can be explained by the variation in the hours » 0.691 of TV watched. About 30.9% of the variation is unexplained. RStudio

• Function cor.test is used to calculate correlation r, and t statistics. • Function lm is used to calculate regression

• Example: Hours, x 0 1 2 3 3 5 5 5 6 7 7 10 Test score, y 96 85 82 74 95 68 76 84 58 65 75 50

• X<-c(0,1,2,3,3,5,5,5,6,7,7,10) • Y<-c(96,85,82,74,95,68,76,84,58,65,75,50)

• cor.test(X,Y)

• G<-lm(X~Y) • Summary(G) RStudio

Count<-c(9,25,15,2,14,25,24,47) > Count [1] 9 25 15 2 14 25 24 47 Ø Speed<-c(2,3,5,9,14,24,29,34)

Ø G<-lm(Count~Speed) Ø > summary(G)

Ø Call: Ø lm(formula = Count ~ Speed)

Ø Residuals: Ø Min 1Q Median 3Q Max Ø -13.377 -5.801 -1.542 5.051 14.371

Ø Coefficients: Ø Estimate Std. Error t value Pr(>|t|) Ø (Intercept) 8.2546 5.8531 1.410 0.2081 Ø Speed 0.7914 0.3081 2.569 0.0424 * Ø --- Ø Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Ø Residual standard error: 10.16 on 6 degrees of freedom Ø Multiple R-squared: 0.5238, Adjusted R-squared: 0.4444 Ø F-statistic: 6.599 on 1 and 6 DF, p-value: 0.0424