Correlation and Regression Overview

Total Page:16

File Type:pdf, Size:1020Kb

Correlation and Regression Overview Introduction to Statistical Methods for Measuring “Omics” and Field Data Correlation and regression Overview • Correlation • Simple Linear Regression Correlation General Overview of Correlational Analysis • The purpose is to measure the strength of a linear relationship between 2 variables. • A correlation coefficient does not ensure “causation” (i.e. a change in X causes a change in Y) • X is typically the input, measured, or independent variable. • Y is typically the output, predicted, or dependent variable. • If X increases and there is a predictable shift in the values of Y, a correlation exists. General Properties of Correlation Coefficients • Values can range between +1 and -1 • The value of the correlation coefficient represents the scatter of points on a scatterplot • You should be able to look at a scatterplot and estimate what the correlation would be • You should be able to look at a correlation coefficient and visualize the scatterplot Interpretation • Depends on what the purpose of the study is… but here is a “general guideline”... • Value = magnitude of the relationship • Sign = direction of the relationship Correlation graph Strong relationships Weak relationships Y Y Positive correlation X X Y Y Negative correlation X X The Pearson Correlation Coefficient Correlation Coefficient • The correlation coefficient is a measure of the strength and the direction of a linear relationship between two variables. The symbol r represents the sample correlation coefficient. The formula for r is n∑xy − ∑x ∑ y r = ( )( ) . 2 2 n∑x 2 − (∑x) n∑ y 2 − (∑ y) • The range of the correlation coefficient is -1 to 1. If x and y have a strong positive linear correlation, r is close to 1. If x and y have a strong negative linear correlation, r is close to -1. If there is no linear correlation or a weak linear correlation, r is close to 0. Calculating a Correlation Coefficient Calculating a Correlation Coefficient In Words In Symbols 1. Find the sum of the x-values. åx 2. Find the sum of the y-values. å y 3. Multiply each x-value by its corresponding y-value and find the sum. åxy 4. Square each x-value and find the sum. åx 2 5. Square each y-value and find the sum. å y 2 6. Use these five sums to calculate the correlation coefficient. nxyå-åå( x)( y) r = . nxå-å22( x)22 ny å-å( y) Continued. Correlation Coefficient Example: Calculate the correlation coefficient r for the following data. x y 1 – 3 2 – 1 3 0 4 1 5 2 nxyå-åå( x)( y) r = nxå-å22( x)22 ny å-å( y) Correlation Coefficient Example: Calculate the correlation coefficient r for the following data. x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4 nxyå-åå( x)( y) r = nxå-å22( x)22 ny å-å( y) Correlation Coefficient Example: Calculate the correlation coefficient r for the following data. x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4 åx =15 åy =-1 åxy =9 åx 2 =55 åy 2 =15 nxyå-åå( x)( y) 5(9) − 15 −1 r = = ( )( ) 2222 2 nxå-å( x) ny å-å( y) 5(55) −152 5(15) − (−1) 60 There is a strong positive = » 0.986 50 74 linear correlation between x and y. Significance Test for Correlation • Hypothesis H0: ρ = 0 (no correlation) HA: ρ ≠ 0 (correlation exists) • Test statistic (with n – 2 degrees of freedom) R/RStudio x y 1 – 3 2 – 1 3 0 4 1 5 2 R/Rstudio Function cor.test is used for calculating correlation X <- c(1,2,3,4,5) Y<-c(-3,-1,0,1,2) cor.test(X,Y) Linear Regression Linear regression • Deals with relationship between two variables X and Y. • Y is the variables whose “behavior” we wish to study ( e.g., fuel efficiency in a car). • X is the variable we believe would help explain the behavior of Y (e.g., the size of the car). Regression model • The simple linear regression model: Regression hypothesis The T statistics for the hypothesis test is t= B1/standard error of b Components of the models Regression Line • A regression line, also called a line of best fit, is the line for which the sum of the squares of the residuals is a minimum. The Equation of a Regression Line The equation of a regression line for an independent variable x and a dependent variable y is ŷ = mx + b where ŷ is the predicted y-value for a given x-value. The slope m and y-intercept b are given by nå-åå xy( x)( y) ååyx mbymxm== and -=- nxå-å2 ( x)2 nn where yx is the mean of the y- values and is the mean of the xxy-values. The regression line always passes through ( , ). Regression Line Example: Find the equation of the regression line. x y 1 – 3 2 – 1 3 0 4 1 5 2 Continued. Regression Line Example: Find the equation of the regression line. x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4 Continued. Regression Line Example: Find the equation of the regression line. x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4 åx =15 åy =-1 åxy =9 åx 2 =55 åy 2 =15 Continued. Regression Line Example: Find the equation of the regression line. x y xy x2 y2 1 – 3 – 3 1 9 2 – 1 – 2 4 1 3 0 0 9 0 4 1 4 16 1 5 2 10 25 4 åx =15 åy =-1 åxy =9 åx 2 =55 åy 2 =15 nå-åå xy( x)( y) 5(9)--( 15)( 1) 60 m = = = =1.2 nxå-å2 ( x)2 5(55)- ( 15)2 50 Continued. Regression Line Example: The following data represents the number of hours 12 different students watched television during the weekend and the scores of each student who took a test the following Monday. a.) Find the equation of the regression line. b.) Use the equation to find the expected test score for a student who watches 9 hours of TV. Hours, x 0 1 2 3 3 5 5 5 6 7 7 10 Test score, y 96 85 82 74 95 68 76 84 58 65 75 50 xy 0 85 164 222 285 340 380 420 348 455 525 500 x2 0 1 4 9 9 25 25 25 36 49 49 100 y2 9216 7225 6724 5476 9025 4624 5776 7056 3364 4225 5625 2500 åx =54 åy =908 åxy =3724 åx 2 =332 åy 2 =70836 Regression Line Example continued: Using the equation ŷ = -4.07x + 93.97, we can predict the test score for a student who tool watches 9 hours of TV. ŷ = –4.07x + 93.97 = –4.07(9) + 93.97 = 57.34 A student who watches 9 hours of TV over the weekend can expect to receive about a 57.34 on Monday’s test. Variation About a Regression Line The total variation about a regression line is the sum of the squares of the differences between the y-value of each ordered pair and the mean of y. 2 Total variation =å(yyi - ) The explained variation is the sum of the squares of the differences between each predicted y-value and the mean of y. 2 Explained variation =å(yyˆi - ) The unexplained variation is the sum of the squares of the differences between the y-value of each ordered pair and each corresponding predicted y-value. 2 Unexplained variation =å(yyii - ˆ ) Total variation=+ Explained variation Unexplained variation Coefficient of Determination • The coefficient of determination R2 is the ratio of the explained variation to the total variation. That is, 2 Explained variation R = Total variation Example: • The correlation coefficient for the data that represents the number of hours students watched television and the test scores of each student is r » -0.831. Find the coefficient of determination. R 2 ≈ (−0.831)2 About 69.1% of the variation in the test scores can be explained by the variation in the hours » 0.691 of TV watched. About 30.9% of the variation is unexplained. RStudio • Function cor.test is used to calculate correlation r, and t statistics. • Function lm is used to calculate regression • Example: Hours, x 0 1 2 3 3 5 5 5 6 7 7 10 Test score, y 96 85 82 74 95 68 76 84 58 65 75 50 • X<-c(0,1,2,3,3,5,5,5,6,7,7,10) • Y<-c(96,85,82,74,95,68,76,84,58,65,75,50) • cor.test(X,Y) • G<-lm(X~Y) • Summary(G) RStudio Count<-c(9,25,15,2,14,25,24,47) > Count [1] 9 25 15 2 14 25 24 47 Ø Speed<-c(2,3,5,9,14,24,29,34) Ø G<-lm(Count~Speed) Ø > summary(G) Ø Call: Ø lm(formula = Count ~ Speed) Ø Residuals: Ø Min 1Q Median 3Q Max Ø -13.377 -5.801 -1.542 5.051 14.371 Ø Coefficients: Ø Estimate Std. Error t value Pr(>|t|) Ø (Intercept) 8.2546 5.8531 1.410 0.2081 Ø Speed 0.7914 0.3081 2.569 0.0424 * Ø --- Ø Signif.
Recommended publications
  • Principal Component Analysis (PCA) As a Statistical Tool for Identifying Key Indicators of Nuclear Power Plant Cable Insulation
    Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations 2017 Principal component analysis (PCA) as a statistical tool for identifying key indicators of nuclear power plant cable insulation degradation Chamila Chandima De Silva Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/etd Part of the Materials Science and Engineering Commons, Mechanics of Materials Commons, and the Statistics and Probability Commons Recommended Citation De Silva, Chamila Chandima, "Principal component analysis (PCA) as a statistical tool for identifying key indicators of nuclear power plant cable insulation degradation" (2017). Graduate Theses and Dissertations. 16120. https://lib.dr.iastate.edu/etd/16120 This Thesis is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Principal component analysis (PCA) as a statistical tool for identifying key indicators of nuclear power plant cable insulation degradation by Chamila C. De Silva A thesis submitted to the graduate faculty in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Major: Materials Science and Engineering Program of Study Committee: Nicola Bowler, Major Professor Richard LeSar Steve Martin The student author and the program of study committee are solely responsible for the content of this thesis. The Graduate College will ensure this thesis is globally accessible and will not permit alterations after a degree is conferred.
    [Show full text]
  • Measures of Explained Variation and the Base-Rate Problem for Logistic Regression
    American Journal of Biostatistics 2 (1): 11-19, 2011 ISSN 1948-9889 © 2011 Science Publications Measures of Explained Variation and the Base-Rate Problem for Logistic Regression 1Dinesh Sharma, 2Dan McGee and 3B.M. Golam Kibria 1Department of Mathematics and Statistics, James Madison University, Harrisonburg, VA 22801 2Department of Statistics, Florida State University, Tallahassee, FL 32306 3Department of Mathematics and Statistics, Florida International University, Miami, FL 33199 Abstract: Problem statement: Logistic regression, perhaps the most frequently used regression model after the General Linear Model (GLM), is extensively used in the field of medical science to analyze prognostic factors in studies of dichotomous outcomes. Unlike the GLM, many different proposals have been made to measure the explained variation in logistic regression analysis. One of the limitations of these measures is their dependency on the incidence of the event of interest in the population. This has clear disadvantage, especially when one seeks to compare the predictive ability of a set of prognostic factors in two subgroups of a population. Approach: The purpose of this article is to study the base-rate sensitivity of several R2 measures that have been proposed for use in logistic regression. We compared the base-rate sensitivity of thirteen R2 type parametric and nonparametric statistics. Since a theoretical comparison was not possible, a simulation study was conducted for this purpose. We used results from an existing dataset to simulate populations with different base-rates. Logistic models are generated using the covariate values from the dataset. Results: We found nonparametric R2 measures to be less sensitive to the base-rate as compared to their parametric counterpart.
    [Show full text]
  • Principal Component Analysis and Optimization: a Tutorial Robert Reris Virginia Commonwealth University, [email protected]
    CORE Metadata, citation and similar papers at core.ac.uk Provided by VCU Scholars Compass Virginia Commonwealth University VCU Scholars Compass Statistical Sciences and Operations Research Dept. of Statistical Sciences and Operations Publications Research 2015 Principal Component Analysis and Optimization: A Tutorial Robert Reris Virginia Commonwealth University, [email protected] J. Paul Brooks Virginia Commonwealth University, [email protected] Follow this and additional works at: http://scholarscompass.vcu.edu/ssor_pubs Part of the Statistics and Probability Commons Creative Commons Attribution 3.0 Unported (CC BY 3.0) Recommended Citation Principal component analysis (PCA) is one of the most widely used multivariate techniques in statistics. It is commonly used to reduce the dimensionality of data in order to examine its underlying structure and the covariance/correlation structure of a set of variables. While singular value decomposition provides a simple means for identification of the principal components (PCs) for classical PCA, solutions achieved in this manner may not possess certain desirable properties including robustness, smoothness, and sparsity. In this paper, we present several optimization problems related to PCA by considering various geometric perspectives. New techniques for PCA can be developed by altering the optimization problems to which principal component loadings are the optimal solutions. This Conference Proceeding is brought to you for free and open access by the Dept. of Statistical Sciences and Operations Research at VCU Scholars Compass. It has been accepted for inclusion in Statistical Sciences and Operations Research Publications by an authorized administrator of VCU Scholars Compass. For more information, please contact [email protected]. http://dx.doi.org/10.1287/ics.2015.0016 Creative Commons License Computing Society 14th INFORMS Computing Society Conference This work is licensed under a Richmond, Virginia, January 11{13, 2015 Creative Commons Attribution 3.0 License pp.
    [Show full text]
  • Prediction of Stress Increase in Unbonded Tendons Using Sparse Principal Component Analysis
    Utah State University DigitalCommons@USU All Graduate Plan B and other Reports Graduate Studies 8-2017 Prediction of Stress Increase in Unbonded Tendons using Sparse Principal Component Analysis Eric Mckinney Utah State University Follow this and additional works at: https://digitalcommons.usu.edu/gradreports Part of the Applied Mathematics Commons, Applied Statistics Commons, and the Statistical Theory Commons Recommended Citation Mckinney, Eric, "Prediction of Stress Increase in Unbonded Tendons using Sparse Principal Component Analysis" (2017). All Graduate Plan B and other Reports. 1034. https://digitalcommons.usu.edu/gradreports/1034 This Creative Project is brought to you for free and open access by the Graduate Studies at DigitalCommons@USU. It has been accepted for inclusion in All Graduate Plan B and other Reports by an authorized administrator of DigitalCommons@USU. For more information, please contact [email protected]. PREDICTION OF STRESS INCREASE IN UNBONDED TENDONS USING SPARSE PRINCIPAL COMPONENT ANALYSIS by Eric McKinney A project submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in Statistics Approved: Yan Sun Marc Maguire Major Professor Committee Member Adele Cutler Daniel Coster Committee Member Committee Member UTAH STATE UNIVERSITY Logan, Utah 2017 ii Copyright c Eric McKinney 2017 All Rights Reserved iii ABSTRACT Prediction of Stress Increase in Unbonded Tendons using Sparse Principal Component Analysis by Eric McKinney, Master of Science Utah State University, 2017 Major Professor: Dr. Yan Sun Department: Mathematics and Statistics While internal and external unbonded tendons are widely utilized in concrete struc- tures, the analytic solution for the increase in unbonded tendon stress, ∆fps, is chal- lenging due to the lack of bond between strand and concrete.
    [Show full text]
  • Simple Linear Regression 80 60 Rating 40 20
    Simple Linear 12 Regression Material from Devore’s book (Ed 8), and Cengagebrain.com Simple Linear Regression 80 60 Rating 40 20 0 5 10 15 Sugar 2 Simple Linear Regression 80 60 Rating 40 20 0 5 10 15 Sugar 3 Simple Linear Regression 80 60 Rating 40 xx 20 0 5 10 15 Sugar 4 The Simple Linear Regression Model The simplest deterministic mathematical relationship between two variables x and y is a linear relationship: y = β0 + β1x. The objective of this section is to develop an equivalent linear probabilistic model. If the two (random) variables are probabilistically related, then for a fixed value of x, there is uncertainty in the value of the second variable. So we assume Y = β0 + β1x + ε, where ε is a random variable. 2 variables are related linearly “on average” if for fixed x the actual value of Y differs from its expected value by a random amount (i.e. there is random error). 5 A Linear Probabilistic Model Definition The Simple Linear Regression Model 2 There are parameters β0, β1, and σ , such that for any fixed value of the independent variable x, the dependent variable is a random variable related to x through the model equation Y = β0 + β1x + ε The quantity ε in the model equation is the “error” -- a random variable, assumed to be symmetrically distributed with 2 2 E(ε) = 0 and V(ε) = σ ε = σ (no assumption made about the distribution of ε, yet) 6 A Linear Probabilistic Model X: the independent, predictor, or explanatory variable (usually known).
    [Show full text]
  • Linear Regression
    Linear Regression Fall 2001 Professor Paul Glasserman B6014: Managerial Statistics 403 Uris Hall General Ideas of Linear Regression 1. Regression analysis is a technique for using data to identify relationships among vari- ables and use these relationships to make predictions. We will be studying linear re- gression, in which we assume that the outcome we are predicting depends linearly on the information used to make the prediction. Linear dependence means constant rate of increase of one variable with respect to another (as opposed to, e.g., diminishing returns). 2. Some motivating examples. (a) Suppose we have data on sales of houses in some area. For each house, we have com- plete information about its size, the number of bedrooms, bathrooms, total rooms, the size of the lot, the corresponding property tax, etc., and also the price at which the house was eventually sold. Can we use this data to predict the selling price of a house currently on the market? The first step is to postulate a model of how the various features of a house determine its selling price. A linear model would have the following form: selling price = β0 + β1 (sq.ft.) + β2 (no. bedrooms) +β3 (no. bath) + β4 (no. acres) +β5 (taxes) + error In this expression, β1 represents the increase in selling price for each additional square foot of area: it is the marginal cost of additional area. Similarly, β2 and β3 are the marginal costs of additional bedrooms and bathrooms, and so on. The intercept β0 could in theory be thought of as the price of a house for which all the variables specified are zero; of course, no such house could exist, but including β0 gives us more flexibility in picking a model.
    [Show full text]
  • Measures of Explained Variation in Gamma Regression Models
    COMMUN. STATIST.—SIMULA., 31(1), 61–73 (2002) REGRESSION AND TIME SERIES MEASURES OF EXPLAINED VARIATION IN GAMMA REGRESSION MODELS Martina Mittlbo¨ ck1,* and Harald Heinzl1,2,y 1Department of Medical Computer Sciences, University of Vienna, Spitalgasse 23, A-1090 Vienna, Austria 2Division of Biostatistics, German Cancer Research Center, Im Neuenheimer Feld 280, D-69120 Heidelberg, Germany ABSTRACT The common R2 measure provides a useful means to quantify the degree to which variation in the dependent variable can be explained by the covariates in a linear regression model. Recently, there have been various attempts to apply the defi- nition of the R2 measure to generalized linear models. This paper studies two different R2 measure definitions for the gamma regression model. These measures are related to deviance and sum-of-squares residuals. Depending on the sample size and the number of covariates fitted, so-called unadjusted R2 measures may be substantially inflated, and the use of adjusted R2 measures is then preferred. We study several known adjustments previously proposed for R2 mea- sures in regression models and illustrate the effect on the two unadjusted R2 measures for the gamma regression model. *E-mail: [email protected] yE-mail: [email protected] 61 Copyright & 2002 by Marcel Dekker, Inc. www.dekker.com FIRST PROOFS i:/Mdi/Sac/31(1)/Sac 31(1)-005.3d Statistics, Simulation and Computation (SAC) Paper: Sac 31(1)-005 62 MITTLBO¨ CK AND HEINZL Comparing the resulting measures with underlying population values, we find the best adjustment via simulation. Key Words: Adjusted R2 measures; Deviance; Sum-of- squares; Shrinkage; Degrees of freedom; Predictive power; Explained variation 1.
    [Show full text]
  • Linear Regression Using Ordinary Least Squares
    Simple Linear Regression Using Ordinary Least Squares Purpose: To approximate a linear relationship with a line. Reason: We want to be able to predict Y using X. Definition: The Least Squares Regression (LSR) line is the line with the smallest sum of square residuals smaller than any other line. That is, we want to measure closeness of the line to the points. The LSR line uses vertical distance from points to a line. A residual is the vertical distance from a point to a line. Equations: The true line for the population parameter: y = α + βx Note: We are trying to estimate this equation. We obtain a sample and estimate an approximation: yˆ = a + bx + E Estimate y Predict α = y-intercept Predict β = slope x = The independent variable. E = Epsilon or the error. These are random errors of measurement. Check the Assumptions: 1. No outliers. 2. Residuals follow a normal distribution with a mean = 0. 3. Residuals should be randomly scattered. Note: For #2, produce a histogram of the standardized residuals to see if they are normal. Hypotheses for the Correlation Coefficient (R): A measure of how close residuals are to the regression line. H0: ρ = 0 H1: ρ ≠ or < or > 0 Coefficient of Determination = R2 Range 0 – 1 0 ≤ r 2 ≤ 1 R2 = An effect size measure and yields the percentage of the variation in the Y values explained by X. Adjusted R2: Adjusts for R2’s upward bias and is a variance accounted for effect size measure. 2 Adj. R = .40 40% of variation in Y is explained by the regression line and dependent on X.
    [Show full text]
  • Statistical Analysis in JASP
    2nd Edition JASP v0.9.1 October 2018 Copyright © 2018 by Mark A Goss-Sampson. All rights reserved. This book or any portion thereof may not be reproduced or used in any manner whatsoever without the express written permission of the author except for the purposes of research, education or private study. CONTENTS PREFACE .................................................................................................................................................. 1 USING THE JASP INTERFACE .................................................................................................................... 2 DESCRIPTIVE STATISTICS ......................................................................................................................... 9 EXPLORING DATA INTEGRITY ................................................................................................................ 16 DATA TRANSFORMATION ..................................................................................................................... 23 ONE SAMPLE T-TEST ............................................................................................................................. 27 BINOMIAL TEST ..................................................................................................................................... 30 MULTINOMIAL TEST .............................................................................................................................. 33 CHI-SQUARE ‘GOODNESS-OF-FIT’ TEST............................................................................................
    [Show full text]
  • Choosing Sample Size
    Choosing Sample Size The purpose of this section is to describe three related methods for choosing sample size before data are collected -- the classical power method, the sample variation method and the population variation method. The classical power method applies to almost any statistical test. After presenting general principles, the discussion zooms in on the important special case of factorial analysis of variance with no covariates. The sample variation method and the population variation methods are limited to multiple linear regression, including the analysis of variance and covariance. Throughout, it will be assumed that the person designing the study is a scientist who will only be allowed to discuss results if a null hypothesis is rejected at some conventional significance level such as α = 0.05 or α = 0.01. Thus, it is vitally important that the study be designed so that scientifically interesting effects are likely to be be detected as statistically significant. The classical power method. The term "null hypothesis" has mostly been avoided until now, but it's much easier to talk about the classical power method if we're allowed to use it. Most statistical tests are based on comparing a full model to a reduced model. Under the reduced model, the values of population parameters are constrained in some way. For example, in a one-way ANOVA comparing σ2 three treatments, the parameters are µ1, µ2, µ3 and . The reduced model says that µ1=µ2=µ3. This is a constraint on the parameter values. The null hypothesis (symbolized H0) is a statement of how the parameters are constrained under the reduced model.
    [Show full text]
  • Principal Component Analysis and Optimization: a Tutorial Robert Reris Virginia Commonwealth University, [email protected]
    Virginia Commonwealth University VCU Scholars Compass Statistical Sciences and Operations Research Dept. of Statistical Sciences and Operations Publications Research 2015 Principal Component Analysis and Optimization: A Tutorial Robert Reris Virginia Commonwealth University, [email protected] J. Paul Brooks Virginia Commonwealth University, [email protected] Follow this and additional works at: http://scholarscompass.vcu.edu/ssor_pubs Part of the Statistics and Probability Commons Creative Commons Attribution 3.0 Unported (CC BY 3.0) Recommended Citation Principal component analysis (PCA) is one of the most widely used multivariate techniques in statistics. It is commonly used to reduce the dimensionality of data in order to examine its underlying structure and the covariance/correlation structure of a set of variables. While singular value decomposition provides a simple means for identification of the principal components (PCs) for classical PCA, solutions achieved in this manner may not possess certain desirable properties including robustness, smoothness, and sparsity. In this paper, we present several optimization problems related to PCA by considering various geometric perspectives. New techniques for PCA can be developed by altering the optimization problems to which principal component loadings are the optimal solutions. This Conference Proceeding is brought to you for free and open access by the Dept. of Statistical Sciences and Operations Research at VCU Scholars Compass. It has been accepted for inclusion in Statistical Sciences and Operations Research Publications by an authorized administrator of VCU Scholars Compass. For more information, please contact [email protected]. http://dx.doi.org/10.1287/ics.2015.0016 Creative Commons License Computing Society 14th INFORMS Computing Society Conference This work is licensed under a Richmond, Virginia, January 11{13, 2015 Creative Commons Attribution 3.0 License pp.
    [Show full text]
  • What Is R2 All About?
    FIGUEIREDO FILHO Dalson. B.; SILVA, José. A. and; ROCHA, Enivaldo. What is R² all about? What is R2 all about? Dalson Britto Figueiredo Filho1 José Alexandre Silva Enivaldo Rocha Abstract Which substantive meaning we can draw from the R2 statistic? The coefficient of determination is defined as the sum of squares due to the regression divided by the sum of total squares. Usually, R2 is interpreted as representing the percentage of variation in the dependent variable explained by variation in the independent variables. This definition is found by both econometrics and statistics handbooks and is widely accepted among quantitative scholars. However, this view is challenged by one of the most distinguished political science methodologists. Gary King argues that the R2 is highly misused as a measure of the influence of X on Y. This paper analyzes the R2 statistic using a non-technical approach. Our purpose is to provide an intuitive understanding of the coefficient of determination and its major shortcomings. There is a joke that states that if someone asks you about your R2 statistic this is a proxy of little knowledge in both econometrics and statistics2. This is because there are some criticisms regarding the use of R2 as an indicator of the influence of X on Y (Ascombe, 1973; Achen, 1977; King, 1986). In econometrics, Kennedy (2008) argues that “R2 is measured either as the ratio of the „explained‟ variation to the „total‟ variation […] and represents the 1 About writers: Dalson Britto is a PhD political science candidate at Federal University of Pernambuco (UFPE), José Alexandre Silva is Professor at Federal University of Goiás (UFG) and a PhD political science candidate at UFPE.
    [Show full text]