Detecting Trends Using Spearman's Rank Correlation Coefficient
Total Page:16
File Type:pdf, Size:1020Kb
Environmental Forensics 2001) 2, 359±362 doi:10.1006/enfo.2001.0061, available online at http://www.idealibrary.com on Detecting Trends Using Spearman's Rank Correlation Coecient Thomas D. Gauthier Sciences International Inc, 6150 State Road 70, East Bradenton, FL 34203, U.S.A. Received 23 August 2001, Revised manuscript accepted 1 October 2001) Spearman's rank correlation coecient is a useful tool for exploratory data analysis in environmental forensic investigations. In this application it is used to detect monotonic trends in chemical concentration with time or space. # 2001 AEHS Keywords: Spearman rank correlation coecient; trend analysis; soil; groundwater. Environmental forensic investigations can involve a The idea behind the rank correlation coecient is variety of statistical analysis techniques. A relatively simple. Each variable is ranked separately from lowest simple technique that can be used for exploratory data to highest e.g. 1, 2, 3, etc.) and the dierence between analysis is the Spearman rank correlation coecient. In ranks for each data pair is recorded. If the data are this technical note, I describe howthe Spearman rank correlated, then the sum of the square of the dierence correlation coecient can be used as a statistical tool between ranks will be small. The magnitude of the sum to detect monotonic trends in chemical concentrations is related to the signi®cance of the correlation. with time or space. The Spearman rank correlation coecient is calcu- Trend analysis can be useful for detecting chemical lated according to the following equation: ``footprints'' at a site or evaluating the eectiveness of Pn an installed or natural attenuation remedy. Just as r 1 À 6 d2 ratios of chemical concentrations in a single sample can s i i1 1 be viewed as a chemical ``®ngerprint,'' the overall n3 À n pattern created by a single chemical detected at multiple locations at a site can be viewed as a chemical where di is the dierence between ranks for each xi, yi ``footprint.'' Point sources of contamination often data pair and n is the number of data pairs. If ties are leave footprints at a site that can be detected using involved, the equation is more complicated but only the Spearman rank correlation coecient. The foot- makes an appreciable dierence if there are a large prints are created by advection and dispersion pro- number of ties Zar, 1984). cesses that lead to decreasing concentrations with Pn P P distance from the source. The Spearman rank corre- n3À3 2 6 À di À Tx À Ty lation coecient can also be used to determine if i1 concentrations are increasing or decreasing with time rs rhihi 2 n3À3 P n3À3 P at a given well or sampling location to monitor water 6 À 2 Tx 6 À 2 Ty quality or evaluate the eectiveness of a remedy. The Spearman rank correlation coecient is a where g is the number of tied groups, tj is the number of nonparametric technique for evaluating the degree of tied data in the jth group, and linear association or correlation between two indepen- dent variables. It is similar to Pearson's product Pg moment correlation coecient except that it operates t3 À t X j j on the ranks of the data rather than the rawdata. j1 Tx for x values 3 There are advantages to using Spearman's rank 12 correlation coecient over the more common product moment correlation coecient. It is a nonparametric and technique so it is unaected by the distribution of the Pg population. Because the technique operates on the t3 À t ranks of the data it is relatively insensitive to outliers X j j T j1 for y values 4 and there is no requirement that the data be collected y 12 over regularly spaced intervals. It can be used with very small sample sizes and it is easy to apply. The Equation 1) is constructed so that it gives rs 1 when disadvantage is that there is a loss of information the data pairs have a perfect positive correlation when the data are converted to ranks and, if the data di 0) and rs À1 for a perfect negative correlation. are normally distributed, it is less powerful than the For example, with n 5, a perfect negative correlation Pearson correlation coecient. yields the following values for di:4,2,0,À2, and À4. 359 1527-5922/01/040359+04 $35.00/00 # 2001 AEHS 360 T. D. Gauthier The sum of the square of these values is 40 and the values with nÀ2 degrees of freedom as a good rank correlation coecient is approximation. p 6 Á 40 rs n À 2 r 1 À À1 5 t qÀÁ 6 s 125 À 5 2 1 À rs In order to test the signi®cance of a correlation, we In order to demonstrate the technique, consider the assume that there is no correlation between the two hypothetical data set presented in Table 2. Concen- variables and then determine the probability of getting trations of lead in soil have been measured as a a correlation coecient as large or larger or as small function of distance north of a suspected sourceÐan or smaller for a negative correlation) than the onsite land®ll at a former battery-breaking site. The calculated value by chance. This is done by comparing data are plotted in Figure 1 and clearly indicate a trend the calculated value to a table of critical values. If the of decreasing concentration with distance from the absolute value of the calculated value is larger than the land®ll. The data ranks and the squares of the critical value, then the correlation is signi®cant. A dierences between ranks are included in Table 2. partial table of critical values is presented in Table 1. Notice that the distance measurements are easily Alternatively, if the number of data pairs is large ranked and can be irregularly spaced. The sum of the squares of the dierences between ranks is 108. The greater than about 30), a ``t statistic'' can be calculated Spearman rank correlation coecient can be calculated using equation 6) and then compared to a table of t using equation 1) as illustrated below. 6 108 648 r 1 À 1 À À0:928 7 s 73 À 7 336 Table 1. Table of critical values for Spearman's rank correlation coecient In order to test if our calculated value of À0.928 is signi®cant at the 95% probability level, we can Level of signi®cance a) - one-tailed test Number of compare that value to the critical value for n 7and data pairs 0.100 0.050 0.025 0.010 a 0.05 from Table 1. From the table we ®nd rcrit 0.714. Since our calculated value neglecting 4 1.000 1.000 Ð Ð the direction of the correlation) of 0.928 exceeds the 5 0.800 0.900 1.000 1.000 critical value, we can conclude that the trend is 6 0.657 0.829 0.886 0.943 signi®cant at the 95% probability level. 7 0.571 0.714 0.786 0.893 8 0.524 0.643 0.738 0.833 The prior example was rather obvious. Now consider 9 0.483 0.600 0.700 0.783 the data presented in Figure 2. The trend of decreasing 10 0.455 0.564 0.648 0.745 concentration with distance from the source is not as 11 0.427 0.536 0.618 0.709 obvious but appears to be present. However, the 12 0.406 0.503 0.587 0.671 13 0.385 0.484 0.560 0.648 calculated rank correlation coecient of À0.393 is 14 0.367 0.464 0.538 0.622 15 0.354 0.443 0.521 0.604 16 0.341 0.429 0.503 0.582 600 17 0.328 0.414 0.485 0.566 18 0.317 0.401 0.472 0.550 500 19 0.309 0.391 0.460 0.535 400 20 0.299 0.380 0.447 0.520 21 0.292 0.370 0.435 0.508 300 22 0.284 0.361 0.425 0.496 23 0.278 0.353 0.415 0.486 200 24 0.271 0.344 0.406 0.476 100 25 0.265 0.337 0.398 0.466 Concentration (ppm) 26 0.259 0.331 0.390 0.457 0 27 0.255 0.324 0.382 0.448 0 2000 4000 6000 28 0.250 0.317 0.375 0.440 Distance north (feet) 29 0.245 0.312 0.368 0.433 30 0.240 0.306 0.362 0.425 Figure 1. Concentration of lead as a function of distance north for hypothetical data set 1. Table 2. Hypothetical data set 1 Distance north feet) Lead concentration ppm) Distance rank Concentration rank Squared dierence between ranks 0 510 1 7 36 50 380 2 5 9 300 450 3 6 9 800 300 4 4 0 1000 170 5 3 4 2000 45 6 1 25 5000 89 7 2 25 Detecting Trends Using Spearman's Rank Correlation Coecient 361 800 attenuation at each of the sites. The Mann±Kendall 700 test for trend is a nonparametric test that is based on 600 the number of times the data increase or decrease when 500 data points are compared to points that followin time. 400 The technique, described by Gilbert 1987), is similar 300 to the Spearman rank correlation coecient in that it is 200 100 nonparametric, can be used on small data sets, is Concentration (ppm) 0 relatively insensitive to outliers, and can be used with 0 1000 2000 3000 4000 5000 6000 irregularly spaced data.