Environmental Forensics 2001) 2, 359±362 doi:10.1006/enfo.2001.0061, available online at http://www.idealibrary.com on
Detecting Trends Using Spearman's Rank Correlation Coecient
Thomas D. Gauthier
Sciences International Inc, 6150 State Road 70, East Bradenton, FL 34203, U.S.A.
Received 23 August 2001, Revised manuscript accepted 1 October 2001)
Spearman's rank correlation coecient is a useful tool for exploratory data analysis in environmental forensic investigations. In this application it is used to detect monotonic trends in chemical concentration with time or space. # 2001 AEHS
Keywords: Spearman rank correlation coecient; trend analysis; soil; groundwater.
Environmental forensic investigations can involve a The idea behind the rank correlation coecient is variety of statistical analysis techniques. A relatively simple. Each variable is ranked separately from lowest simple technique that can be used for exploratory data to highest e.g. 1, 2, 3, etc.) and the dierence between analysis is the Spearman rank correlation coecient. In ranks for each data pair is recorded. If the data are this technical note, I describe howthe Spearman rank correlated, then the sum of the square of the dierence correlation coecient can be used as a statistical tool between ranks will be small. The magnitude of the sum to detect monotonic trends in chemical concentrations is related to the signi®cance of the correlation. with time or space. The Spearman rank correlation coecient is calcu- Trend analysis can be useful for detecting chemical lated according to the following equation: ``footprints'' at a site or evaluating the eectiveness of Pn an installed or natural attenuation remedy. Just as r 1 À 6 d2 ratios of chemical concentrations in a single sample can s i i1 1 be viewed as a chemical ``®ngerprint,'' the overall n3 À n pattern created by a single chemical detected at multiple locations at a site can be viewed as a chemical where di is the dierence between ranks for each xi, yi ``footprint.'' Point sources of contamination often data pair and n is the number of data pairs. If ties are leave footprints at a site that can be detected using involved, the equation is more complicated but only the Spearman rank correlation coecient. The foot- makes an appreciable dierence if there are a large prints are created by advection and dispersion pro- number of ties Zar, 1984). cesses that lead to decreasing concentrations with Pn P P distance from the source. The Spearman rank corre- n3À3 2 6 À di À Tx À Ty lation coecient can also be used to determine if i1 concentrations are increasing or decreasing with time rs rhihi 2 n3À3 P n3À3 P at a given well or sampling location to monitor water 6 À 2 Tx 6 À 2 Ty quality or evaluate the eectiveness of a remedy. The Spearman rank correlation coecient is a where g is the number of tied groups, tj is the number of nonparametric technique for evaluating the degree of tied data in the jth group, and linear association or correlation between two indepen- dent variables. It is similar to Pearson's product Pg moment correlation coecient except that it operates t3 À t X j j on the ranks of the data rather than the rawdata. j1 Tx for x values 3 There are advantages to using Spearman's rank 12 correlation coecient over the more common product moment correlation coecient. It is a nonparametric and technique so it is unaected by the distribution of the Pg population. Because the technique operates on the t3 À t ranks of the data it is relatively insensitive to outliers X j j T j1 for y values 4 and there is no requirement that the data be collected y 12 over regularly spaced intervals. It can be used with very small sample sizes and it is easy to apply. The Equation 1) is constructed so that it gives rs 1 when disadvantage is that there is a loss of information the data pairs have a perfect positive correlation when the data are converted to ranks and, if the data di 0) and rs À1 for a perfect negative correlation. are normally distributed, it is less powerful than the For example, with n 5, a perfect negative correlation Pearson correlation coecient. yields the following values for di:4,2,0,À2, and À4. 359 1527-5922/01/040359+04 $35.00/00 # 2001 AEHS 360 T. D. Gauthier
The sum of the square of these values is 40 and the values with nÀ2 degrees of freedom as a good rank correlation coecient is approximation. p 6 Á 40 rs n À 2 r 1 À À1 5 t qÀÁ 6 s 125 À 5 2 1 À rs
In order to test the signi®cance of a correlation, we In order to demonstrate the technique, consider the assume that there is no correlation between the two hypothetical data set presented in Table 2. Concen- variables and then determine the probability of getting trations of lead in soil have been measured as a a correlation coecient as large or larger or as small function of distance north of a suspected sourceÐan or smaller for a negative correlation) than the onsite land®ll at a former battery-breaking site. The calculated value by chance. This is done by comparing data are plotted in Figure 1 and clearly indicate a trend the calculated value to a table of critical values. If the of decreasing concentration with distance from the absolute value of the calculated value is larger than the land®ll. The data ranks and the squares of the critical value, then the correlation is signi®cant. A dierences between ranks are included in Table 2. partial table of critical values is presented in Table 1. Notice that the distance measurements are easily Alternatively, if the number of data pairs is large ranked and can be irregularly spaced. The sum of the squares of the dierences between ranks is 108. The greater than about 30), a ``t statistic'' can be calculated Spearman rank correlation coecient can be calculated using equation 6) and then compared to a table of t using equation 1) as illustrated below.
6 108 648 r 1 À 1 À À0:928 7 s 73 À 7 336 Table 1. Table of critical values for Spearman's rank correlation coecient In order to test if our calculated value of À0.928 is signi®cant at the 95% probability level, we can Level of signi®cance a) - one-tailed test Number of compare that value to the critical value for n 7and data pairs 0.100 0.050 0.025 0.010 a 0.05 from Table 1. From the table we ®nd rcrit 0.714. Since our calculated value neglecting 4 1.000 1.000 Ð Ð the direction of the correlation) of 0.928 exceeds the 5 0.800 0.900 1.000 1.000 critical value, we can conclude that the trend is 6 0.657 0.829 0.886 0.943 signi®cant at the 95% probability level. 7 0.571 0.714 0.786 0.893 8 0.524 0.643 0.738 0.833 The prior example was rather obvious. Now consider 9 0.483 0.600 0.700 0.783 the data presented in Figure 2. The trend of decreasing 10 0.455 0.564 0.648 0.745 concentration with distance from the source is not as 11 0.427 0.536 0.618 0.709 obvious but appears to be present. However, the 12 0.406 0.503 0.587 0.671 13 0.385 0.484 0.560 0.648 calculated rank correlation coecient of À0.393 is 14 0.367 0.464 0.538 0.622 15 0.354 0.443 0.521 0.604 16 0.341 0.429 0.503 0.582 600 17 0.328 0.414 0.485 0.566 18 0.317 0.401 0.472 0.550 500 19 0.309 0.391 0.460 0.535 400 20 0.299 0.380 0.447 0.520 21 0.292 0.370 0.435 0.508 300 22 0.284 0.361 0.425 0.496 23 0.278 0.353 0.415 0.486 200 24 0.271 0.344 0.406 0.476 100 25 0.265 0.337 0.398 0.466 Concentration (ppm) 26 0.259 0.331 0.390 0.457 0 27 0.255 0.324 0.382 0.448 0 2000 4000 6000 28 0.250 0.317 0.375 0.440 Distance north (feet) 29 0.245 0.312 0.368 0.433 30 0.240 0.306 0.362 0.425 Figure 1. Concentration of lead as a function of distance north for hypothetical data set 1.
Table 2. Hypothetical data set 1
Distance north feet) Lead concentration ppm) Distance rank Concentration rank Squared dierence between ranks
0 510 1 7 36 50 380 2 5 9 300 450 3 6 9 800 300 4 4 0 1000 170 5 3 4 2000 45 6 1 25 5000 89 7 2 25 Detecting Trends Using Spearman's Rank Correlation Coecient 361
800 attenuation at each of the sites. The Mann±Kendall 700 test for trend is a nonparametric test that is based on 600 the number of times the data increase or decrease when 500 data points are compared to points that followin time. 400 The technique, described by Gilbert 1987), is similar 300 to the Spearman rank correlation coecient in that it is 200 100 nonparametric, can be used on small data sets, is Concentration (ppm) 0 relatively insensitive to outliers, and can be used with 0 1000 2000 3000 4000 5000 6000 irregularly spaced data. One advantage of the Spear- Distance north (feet) man rank correlation coecient over the Mann± Figure 2. Concentration of lead as a function of distance north for Kendall test is that is easier to calculate, particularly hypothetical data set 2. for larger data sets. MTBE concentrations recorded quarterly and then not signi®cant at the 95% probability level. There is semi-annually at two wells MW-1 and MW-4) over a sucient scatter in the data to suggest that the period of about 3.5 years are reported in Table 3 and observed spatial distribution of the lead concentrations plotted on a relative concentration scale in Figure 3. could have occurred by chance. Of course, proper Using the Mann±Kendall test, the authors reported a interpretation of the results should include other decreasing trend in MTBE concentration over time, at considerations like soil type, species of lead, and a 90% con®dence level, for MW-1 but not MW-4. presence of other metals, for example. At least, the Although absolute concentrations vary signi®cantly results of the rank correlation test should prompt a between MW-1 and MW-4, relative changes in closer inspection of the data. concentration over time are similar as shown in In this ®nal example, the Spearman rank correlation Figure 3. The only real dierence occurs in April coecient is applied to a series of data reported by 1998 when the MTBE concentration spikes up from a Robb and Moyer 2001). In this recent paper, Robb previous value of 2 to 520 mg/L in MW-4; whereas in and Moyer describe use of the Mann±Kendall test for MW-1, MTBE concentrations remain essentially con- evaluating trends in benzene and MTBE concen- stant at about 4,800 mg/L. trations with time at four Midwestern retail gasoline The Spearman rank correlation coecients for MW- outlets in order to evaluate the eectiveness of natural 1 and MW-4 are À0.685 and À0.430, respectively. The critical value from Table 1 for n 10 and n 0.1 is 0.455. Thus, at the 90% probability, we conclude that Table 3. MTBE concentration data as a function of time 1Robb and the trend of decreasing concentration with time is Moyer, 2001) signi®cant in MW-1 but not in MW-4, similar to the Mann±Kendall test results reported by Robb and Concentration MTBE mg/L) Moyer. Once again, it should be stressed that the Sampling date MW-1 MW-4 Spearman rank correlation coecient should be used as an exploratory tool in conjunction with other data July 1997 15,000 460 and conclusions should be based on all the data. The October 1997 39,000 670 fact that MTBE concentrations were decreasing in two January 1998 4800 2 other wells in addition to MW-1 suggests that natural April 1998 4700 520 July 1998 1500 180 attenuation is occurring at the Site. The fact that no October 1998 7000 140 signi®cant trend was detected in MW-4 is not May 1999 4400 110 necessarily attributed to the spike observed in April November 1999 5000 190 1998. Neglecting this data point, the rank correlation June 2000 2200 62 November 2000 1300 150 coecient is calculated to be À0.417, which is still not signi®cant, compared to the critical value of 0.483.
1.2 MW-1 MW-4 1
0.8
0.6
0.4
Relative concentration 0.2
0 Dec-96 Jun-97 Jan-98 Jul-98 Feb-99 Aug-99 Mar-00 Oct-00 Apr-01 Sampling date
Figure 3. Relative concentration of MTBE detected in wells MW-1 and MW-4 Robb and Moyer, 2001). 362 T. D. Gauthier
What if the ®rst two sampling dates July 1997 and in the same month over several years. For example, El- October 1997) are omitted? Then the rank correlation Shaarawi, Esterby and Kuntz 1983) examined trends in coecients are calculated to be À0.405 and 0.0 for water quality parameters measured in the Niagara MW-1 and MW-4, respectively. The critical value for River using the Spearman rank correlation coecient n 8andn 0.1 is 0.524. Thus, the decreasing trend applied to monthly means, one month at a time. Six in MW-1 is not signi®cant at the 90% con®dence level years of data were available. An alternative approach is and no trend is detected in MW-4. This analysis to use a test such as the seasonal Kendall test, described suggests that concentrations have not decreased sig- by Gilbert 1987) and examined in detail by Hirsch, ni®cantly since January 1998. Thus, in reviewing Slack and Smith 1982), that is able to handle seasonal statistical analysis results, it is important to consider variations. which data have been included in the analysis and The Spearman rank correlation coecient can be a which data have been omitted. This is always a concern useful tool for exploratory data analysis. Potential when the statistical tests are designed after the data applications are numerous. We have used this tech- have been collected. Ideally, the statistical tests should nique to establish trends in soil lead concentrations as a be designed prior to any data collection. Unfortu- function of distance from the home paint lead), nately, in most forensic applications, such is rarely the nearby roadways automobile emissions) and a distant case. air emission source to help allocate source contri- Two other issues warrant consideration when using butions. Each source should be associated with a the Spearman rank correlation coecient for detecting unique footprint. The technique can also be used to trends in time series data: serial correlation and evaluate ground water contamination data to help seasonal eects. A common characteristic of time identify plumes when the data appear to be random. series data is that the data tend to be serially Concentrations in a single well can also be evaluated as correlatedÐthat is, the value of a data point tends to a function of time to showwhetherconcentrations are be related to previous data points collected in the series increasing or decreasing with time. even after the underlying trend and any seasonal aects are accounted for Montgomery and Reckhow, 1984; Gilbert, 1987; Berthouex and Brown, 1994). Serial References correlation or autocorrelation is detected by examining Berthouex, P.M. and Brown, L.C. 1994. Statistics for Environmental the residuals after modeled characteristics such as trend Engineers. Boca Raton, FL, CRC Press. and seasonality are removed Montgomery and Reck- El-Shaarawi, A.H., Esterby, S.R. and Kuntz, K.W. 1983. A statistical how, 1984). Serial correlation is dicult to detect in evaluation of trends in the water quality of the Niagara River. J. Great Lakes Res 9, 234±240. small samples but tests such as the Durbin Watson test Gilbert, R.O. 1987. Statistical Methods for Environmental Pollution are available Berthouex and Brown, 1994). Serial Monitoring. NewYork, NY, Van Nostrand Reinhold. correlation in time series data tends to be an issue when Hirsch, R.M., Slack, J.R. and Smith, R.A. 1982. Techniques of trend the time steps are small relative to the residence time of analysis for monthly water quality data. Water Resources Research the medium being sampled. 181), 107±121. Montgomery, R.H. and Reckhow, K.H. 1984. Techniques for Seasonal variations in a data set can hinder the detecting trends in lake water quality. Water Resources Bulletin detection of long-term trends. According to Gilbert 201), 43±52. 1987), there are two possible solutions to this Robb, J. and Moyer, E. 2001. Natural attenuation of benzene and problemÐremove the seasonal variation before apply- MTBE at four midwestern retail gasoline marketing outlets. Contaminated Soil Sediment and Water. Spring 2001 Special ing the test, or use a test that is unaected by seasonal Issue, 64±71. variation. If the data set is large enough, the seasonal Zar, J.H. 1984. Biostatistical Analysis. Second Edition. Englewood variation could be removed by examining data collected Clis, NJ, Prentice Hall.