Statistical Concepts Series Kelly H. Zou, PhD Correlation and Simple Linear Kemal Tuncali, MD 1 Stuart G. Silverman, MD Regression Index terms: In this tutorial article, the concepts of correlation and regression are reviewed and adiology Data analysis demonstrated. The authors review and compare two correlation coefficients, the Statistical analysis R Pearson correlation coefficient and the Spearman , for measuring linear and non- Published online linear relationships between two continuous variables. In the case of measuring the 10.1148/radiol.2273011499 linear relationship between a predictor and an outcome variable, simple linear Radiology 2003; 227:617–628 regression analysis is conducted. These statistical concepts are illustrated by using a data set from published literature to assess a computed tomography–guided inter- 1 From the Department of Radiology, ventional technique. These statistical methods are important for exploring the Brigham and Women’s Hospital (K.H.Z., relationships between variables and can be applied to many radiologic studies. K.T., S.G.S.) and Department of Health © RSNA, 2003 Care Policy (K.H.Z.), Harvard Medical School, 180 Longwood Ave, Boston, MA 02115. Received September 10, 2001; revision requested October 31; revision received December 26; ac- cepted January 21, 2002. Address correspondence to K.H.Z. (e-mail: Results of clinical studies frequently yield data that are dependent of each other (eg, total [email protected]). procedure time versus the dose in computed tomographic [CT] fluoroscopy, signal-to- © RSNA, 2003 noise ratio versus number of signals acquired during magnetic resonance imaging, and cigarette smoking versus lung cancer). The statistical concepts correlation and regression, which are used to evaluate the relationship between two continuous variables, are re- viewed and demonstrated in this article. Analyses between two variables may focus on (a) any association between the variables, (b) the value of one variable in predicting the other, and (c) the amount of agreement. Agreement will be discussed in a future article. Regression analysis focuses on the form of the relationship between variables, while the objective of correlation analysis is to gain insight into the strength of the relationship (1,2). Note that these two techniques are used to investigate relationships between continuous variables, whereas the 2 test is an exam- ple of a test for association between categorical variables. Continuous variables, such as procedure time, patient age, and number of lesions, have no gaps on the measurement scale. In contrast, categorical variables, such as patient sex and tissue classification based on segmentation, have gaps in their possible values. These two types of variables and the assumptions about their measurement scales were reviewed and distinguished in an article by Applegate and Crewson (3) published earlier in this Statistical Concepts Series in Radiology. Specifically, the topics covered herein include two commonly used correlation coeffi- cients, the Pearson correlation coefficient (4,5) and the Spearman (6–10) for measuring linear and nonlinear relationship, respectively, between two continuous variables. Corre- lation analysis is often conducted in a retrospective or observational study. In a clinical trial, on the other hand, the investigator may also wish to manipulate the values of one variable and assess the changes in values of another variable. To evaluate the relative impact of the predictor variable on the particular outcome, simple regression analysis is preferred. We illustrate these statistical concepts with existing data to assess patient skin dose based on total procedure time by using a quick-check method in CT fluoroscopy– guided abdominal interventions (11). These statistical methods are useful tools for assessing the relationships between con- tinuous variables collected from a clinical study. However, it is also important to distin- guish these statistical methods. While they are similar mathematically, their purposes are different. Correlation analysis is generally overused. It is often interpreted incorrectly (to establish “causation”) and should be reserved for generating hypotheses rather than for testing them. On the other hand, regression modeling is a more useful statistical technique that allows us to assess the strength of the relationships in the data and the uncertainty in the model by using confidence intervals (12,13). 617 Figure 1. Scatterplots of four sets of data generated by means of the following Pearson correlation coefficients (from left to right): r ϭ 0 adiology (uncorrelated data), r ϭ 0.8 (strongly positively correlated), r ϭ 1.0 (perfectly positively correlated), and r ϭϪ1 (perfectly negatively correlated). R CORRELATION Rank Correlation TABLE 1 The Spearman is the sample correla- Interpretation of Correlation The purpose of correlation analysis is tion coefficient (r ) of the ranks (the rel- Coefficient to measure and interpret the strength of s ative order) based on continuous data a linear or nonlinear (eg, exponential, Correlation Direction and Strength (19,20). It was first introduced by Spear- polynomial, and logistic) relationship be- Coefficient Value of Correlation man in 1904 (6). The Spearman is used tween two continuous variables. When Ϫ to measure the monotonic relationship 1.0 Perfectly negative conducting correlation analysis, we use Ϫ0.8 Strongly negative between two variables (ie, whether one Ϫ the term association to mean “linear asso- 0.5 Moderately negative variable tends to take either a larger or Ϫ0.2 Weakly negative ciation” (1,2). Herein, we focus on the smaller value, though not necessarily lin- 0.0 No association Pearson and Spearman correlation co- ϩ early) by increasing the value of the other 0.2 Weakly positive efficients. Both correlation coefficients ϩ0.5 Moderately positive variable. ϩ take on values between Ϫ1 and ϩ1, rang- 0.8 Strongly positive ϩ1.0 Perfectly positive ing from being negatively correlated (Ϫ1) Linear versus Rank Correlation to uncorrelated (0) to positively corre- Note.—The sign of the correlation coefficient Coefficients lated (ϩ1). The sign of the correlation (ie, positive or negative) defines the direction of the relationship. The absolute value indi- coefficient (ie, positive or negative) de- The Pearson correlation coefficient ne- cates the strength of the correlation. fines the direction of the relationship. cessitates use of interval or continuous The absolute value indicates the strength measurement scales of the measured out- of the correlation (Table 1, Fig 1). We come in the study population. In con- elaborate on two correlation coefficients, trast, rank correlations also work well linear (eg, Pearson) and rank (eg, Spear- with ordinal rating data, and continuous Spearman can be found in the literature man), that are commonly used for mea- data are reduced to their ranks (Appendix (20,21). suring linear and general relationships C) (20,21). The rank procedure will also between two variables. be illustrated briefly with our example Limitations and Precautions data. The smallest value in the sample It is worth noting that even if two vari- Linear Correlation has rank 1, and the largest has the high- ables (eg, cigarette smoking and lung est rank. In general, rank correlations are The Pearson correlation coefficient is cancer) are highly correlated, it is not not easily influenced by the presence of also known as the sample correlation co- sufficient proof of causation. One vari- skewed data or data that are highly vari- efficient (r), product-moment correlation able may cause the other or vice versa, or able. coefficient, or coefficient of correlation a third factor is involved, or a rare event (14). It was introduced by Galton in 1877 may have occurred. To conclude causa- Statistical Hypothesis Tests for a (15,16) and developed later by Pearson tion, the causal variables must precede Correlation Coefficient (17). It measures the linear relationship the variable it causes, and several con- between two random variables. For ex- The null hypothesis states that the un- ditions must be met (eg, reversibility, ample, when the value of the predictor is derlying linear correlation has a hypoth- strength, and exposure response on the basis of the Bradford-Hill criteria or the manipulated (increased or decreased) by esized value, 0. The one-sided alterna- a fixed amount, the outcome variable tive hypothesis is that the underlying Rubin causal model) (23–26). changes proportionally (linearly). A lin- value exceeds (or is less than) 0. When ear correlation coefficient can be com- the sample size (n) of the paired data is SIMPLE LINEAR REGRESSION puted by means of the data and their large (n Ն 30 for each variable), the stan- sample means (Appendix A). When a sci- dard error (s) of the linear correlation (r) The purpose of simple regression analysis entific study is planned, the required is approximately s(r) ϭ (1 Ϫ r2)/͌n. The is to evaluate the relative impact of a Ϫ sample size may be computed on the ba- test statistic value (r 0)/s(r) may be predictor variable on a particular out- sis of a certain hypothesized value with computed by means of the z test (22). If come. This is different from a correlation the desired statistical power at a specified the P value is below .05, the null hypoth- analysis, where the purpose is to examine level of significance (Appendix B) (18). esis is rejected. The P value based on the the strength and direction of the rela- 618 ⅐ Radiology ⅐ June 2003 Zou et al the relationship for prediction and estima- performed. (a) To understand whether tion, and (d) assess whether the data fit the assumptions have been met, deter- these criteria before the equation is applied mine the magnitude of the gap between for prediction and estimation. the data and the assumptions of the model. (b) No matter how strong a rela- Least Squares Method tionship is demonstrated with regression analysis, it should not be interpreted as The main goal of linear regression is to causation (as in the correlation analysis). fit a straight line through the data that (c) The regression should not be used to predicts Y based on X.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages6 Page
-
File Size-