Introduction to Statistical Methods for Measuring “Omics” and Field Data
Correlation and regression Overview
• Correlation
• Simple Linear Regression Correlation General Overview of Correlational Analysis • The purpose is to measure the strength of a linear relationship between 2 variables.
• A correlation coefficient does not ensure “causation” (i.e. a change in X causes a change in Y)
• X is typically the input, measured, or independent variable.
• Y is typically the output, predicted, or dependent variable.
• If X increases and there is a predictable shift in the values of Y, a correlation exists. General Properties of Correlation Coefficients
• Values can range between +1 and -1
• The value of the correlation coefficient represents the scatter of points on a scatterplot
• You should be able to look at a scatterplot and estimate what the correlation would be
• You should be able to look at a correlation coefficient and visualize the scatterplot Interpretation
• Depends on what the purpose of the study is… but here is a “general guideline”...
• Value = magnitude of the relationship • Sign = direction of the relationship Correlation graph
Strong relationships Weak relationships
Y Y
Positive correlation
X X
Y Y
Negative correlation
X X The Pearson Correlation Coefficient Correlation Coefficient
• The correlation coefficient is a measure of the strength and the direction of a linear relationship between two variables. The symbol r represents the sample correlation coefficient. The formula for r is
n∑xy − ∑x ∑ y r = ( )( ) . 2 2 n∑x 2 − (∑x) n∑ y 2 − (∑ y)
• The range of the correlation coefficient is -1 to 1. If x and y have a strong positive linear correlation, r is close to 1. If x and y have a strong negative linear correlation, r is close to -1. If there is no linear correlation or a weak linear correlation, r is close to 0. Calculating a Correlation Coefficient
Calculating a Correlation Coefficient In Words In Symbols 1. Find the sum of the x-values. åx 2. Find the sum of the y-values. å y 3. Multiply each x-value by its corresponding y-value and find the sum. åxy 4. Square each x-value and find the sum. åx 2 5. Square each y-value and find the sum. å y 2 6. Use these five sums to calculate the correlation coefficient. nxyå-åå( x)( y) r = . nxå-å22( x)22 ny å-å( y)
Continued. Correlation Coefficient
Example: Calculate the correlation coefficient r for the following data.
x y 1 – 3 2 – 1 3 0 4 1 5 2
nxyå-åå( x)( y) r = nxå-å22( x)22 ny å-å( y) Correlation Coefficient
Example: Calculate the correlation coefficient r for the following data.
x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4
nxyå-åå( x)( y) r = nxå-å22( x)22 ny å-å( y) Correlation Coefficient
Example: Calculate the correlation coefficient r for the following data.
x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4 åx =15 åy =-1 åxy =9 åx 2 =55 åy 2 =15
nxyå-åå( x)( y) 5(9) − 15 −1 r = = ( )( ) 2222 2 nxå-å( x) ny å-å( y) 5(55) −152 5(15) − (−1) 60 There is a strong positive = » 0.986 50 74 linear correlation between x and y. Significance Test for Correlation
• Hypothesis
H0: ρ = 0 (no correlation)
HA: ρ ≠ 0 (correlation exists) • Test statistic
(with n – 2 degrees of freedom) R/RStudio
x y 1 – 3 2 – 1 3 0 4 1 5 2
R/Rstudio Function cor.test is used for calculating correlation X <- c(1,2,3,4,5) Y<-c(-3,-1,0,1,2) cor.test(X,Y) Linear Regression Linear regression
• Deals with relationship between two variables X and Y.
• Y is the variables whose “behavior” we wish to study ( e.g., fuel efficiency in a car).
• X is the variable we believe would help explain the behavior of Y (e.g., the size of the car). Regression model
• The simple linear regression model: Regression hypothesis
The T statistics for the hypothesis test is t= B1/standard error of b Components of the models Regression Line
• A regression line, also called a line of best fit, is the line for which the sum of the squares of the residuals is a minimum.
The Equation of a Regression Line The equation of a regression line for an independent variable x and a dependent variable y is ŷ = mx + b where ŷ is the predicted y-value for a given x-value. The slope m and y-intercept b are given by nå-åå xy( x)( y) ååyx mbymxm== and -=- nxå-å2 ( x)2 nn where yx is the mean of the y- values and is the mean of the xxy-values. The regression line always passes through ( , ). Regression Line Example: Find the equation of the regression line.
x y 1 – 3 2 – 1 3 0 4 1 5 2
Continued. Regression Line Example: Find the equation of the regression line.
x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4
Continued. Regression Line Example: Find the equation of the regression line.
x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4 åx =15 åy =-1 åxy =9 åx 2 =55 åy 2 =15
Continued. Regression Line Example: Find the equation of the regression line.
x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4 åx =15 åy =-1 åxy =9 åx 2 =55 åy 2 =15
nå-åå xy( x)( y) 5(9)--( 15)( 1) 60 m = = = =1.2 nxå-å2 ( x)2 5(55)- ( 15)2 50 Continued. Regression Line Example: The following data represents the number of hours 12 different students watched television during the weekend and the scores of each student who took a test the following Monday.
a.) Find the equation of the regression line. b.) Use the equation to find the expected test score for a student who watches 9 hours of TV.
Hours, x 0 1 2 3 3 5 5 5 6 7 7 10 Test score, y 96 85 82 74 95 68 76 84 58 65 75 50 xy 0 85 164 222 285 340 380 420 348 455 525 500 x2 0 1 4 9 9 25 25 25 36 49 49 100 y2 9216 7225 6724 5476 9025 4624 5776 7056 3364 4225 5625 2500
åx =54 åy =908 åxy =3724 åx 2 =332 åy 2 =70836 Regression Line Example continued: Using the equation ŷ = -4.07x + 93.97, we can predict the test score for a student who tool watches 9 hours of TV.
ŷ = –4.07x + 93.97
= –4.07(9) + 93.97
= 57.34
A student who watches 9 hours of TV over the weekend can expect to receive about a 57.34 on Monday’s test. Variation About a Regression Line
The total variation about a regression line is the sum of the squares of the differences between the y-value of each ordered pair and the mean of y. 2 Total variation =å(yyi - ) The explained variation is the sum of the squares of the differences between each predicted y-value and the mean of y. 2 Explained variation =å(yyˆi - ) The unexplained variation is the sum of the squares of the differences between the y-value of each ordered pair and each corresponding predicted y-value. 2 Unexplained variation =å(yyii - ˆ ) Total variation=+ Explained variation Unexplained variation Coefficient of Determination
• The coefficient of determination R2 is the ratio of the explained variation to the total variation. That is,
2 Explained variation R = Total variation
Example: • The correlation coefficient for the data that represents the number of hours students watched television and the test scores of each student is r » -0.831. Find the coefficient of determination.
R 2 ≈ (−0.831)2 About 69.1% of the variation in the test scores can be explained by the variation in the hours » 0.691 of TV watched. About 30.9% of the variation is unexplained. RStudio
• Function cor.test is used to calculate correlation r, and t statistics. • Function lm is used to calculate regression
• Example: Hours, x 0 1 2 3 3 5 5 5 6 7 7 10 Test score, y 96 85 82 74 95 68 76 84 58 65 75 50
• X<-c(0,1,2,3,3,5,5,5,6,7,7,10) • Y<-c(96,85,82,74,95,68,76,84,58,65,75,50)
• cor.test(X,Y)
• G<-lm(X~Y) • Summary(G) RStudio
Count<-c(9,25,15,2,14,25,24,47) > Count [1] 9 25 15 2 14 25 24 47 Ø Speed<-c(2,3,5,9,14,24,29,34)
Ø G<-lm(Count~Speed) Ø > summary(G)
Ø Call: Ø lm(formula = Count ~ Speed)
Ø Residuals: Ø Min 1Q Median 3Q Max Ø -13.377 -5.801 -1.542 5.051 14.371
Ø Coefficients: Ø Estimate Std. Error t value Pr(>|t|) Ø (Intercept) 8.2546 5.8531 1.410 0.2081 Ø Speed 0.7914 0.3081 2.569 0.0424 * Ø --- Ø Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Ø Residual standard error: 10.16 on 6 degrees of freedom Ø Multiple R-squared: 0.5238, Adjusted R-squared: 0.4444 Ø F-statistic: 6.599 on 1 and 6 DF, p-value: 0.0424